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Virtualizing SOL Server 
with VMware or Hyper-V 


C ompanies using SQL Server have been reluc- 
tant to jump onto the virtualization bandwag- 
on. Long-time observers of server virtualization 
know that virtual machines (VMs) don’t perform 
as well as physical systems and I wanted to know 
if server virtualization for SQL Server is ready for 
prime time yet. So I ran a few tests. 


Virtualization Performance 

In the lab I tested two leading server virtualization 
products: VMware’s ESX Server 3.5 and Windows 
Server 2008's Hyper-V (using the original beta 
code that was shipped with Server 2008). My big 
takeaway: Both platforms are definitely ready for 
production server workloads. 

The hardware platform I used was an HP Pro- 
Liant ML370 Generation 5 (G5), a rack-mounted 
4U server. The ProLiant ML370 G5 has two Ouad- 
Core Intel Xeon processors running at 1.86GHz, 
8GB of RAM, and eight 72GB 15,000 RPM drives 
configured as a RAID array. My SQL Server tests 
used four active SQL Server VMs, each servicing 
40 active connections. The ProLiant ML370 G5 
handled this workload well. Comparing the results 
of the response time obtained running one work- 
load on one VM to the response time of running 
all workloads on all servers, I found that the fully 
loaded workload took about 3.5 times as long to 
run. I found this to be a surprisingly good result be- 
cause the overall workload of the physical machine 
was 40 times higher. I performed this test with no 
tuning, and I didn't add any network cards. But 
even without optimization, there's no doubt that 
virtualization can drive up server utilization rates 
and derive more value from today's high-powered 
multi-core server systems. 

Response times for the various tests were consis- 
tent with each other, which should help you assess 
the service level agreement (SLA) you could meet 
running virtualized SOL Server systems. Although 
my tests were below the transaction level of a large 
enterprise database on a dedicated physical server, 
the tests definitely show that virtualized SOL Serv- 
er systems are able to support server consolidation 
scenarios for small and midsized businesses. You 
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can read more about the VMware ESX Server and 
Microsoft Hyper-V comparison in “Virtualization 


Shootout: Hyper-V vs. ESX Server 3.5,” page 41, | 


InstantDoc ID 99218. 


Challenges of Moving 

to aVirtual Platform 

So what challenges do you face when moving SQL 
Server to a virtual platform? Actually moving the 
server system is the first challenge. To do the job 
you can use a physical-to-virtual (P2V) tool such as 
the VMware Converter for VMware implementa- 
tions (www.vmware.com/download/converter/). Or 
you can use the Microsoft System Center Virtual 
Machine Manager 2007 (SCVMM) for Microsoft 
VM implementations such as Virtual Server (www 
.microsoft.com/systemcenter/scvmm/default 
.mspx). You could also try a third-party P2V 


conversion tool, such as Platespin PowerConvert 
(www.platespin.com/products/powerconvert/) 


or Vizioncore vConverter (www.vizioncore.com/ 


vConverter.html). Or you can just build a new VM, 


install SQL Server, then detach and re-attach your 
databases. 

When you're considering server consolidation, 
another challenge is network bandwidth. Although 
a single server might support all its connections us- 
ing a single network card, that's not likely to be the 
case when you use virtualization to consolidate 
servers. You can use System Monitor to ascertain 
the correct number of NICs. 

You might see licensing as a challenge. Did you 
know that the license for SQL Server 2005 Enter- 
prise Edition allows an unlimited number of vir- 
tual SQL Server instances? You can use any type 
of virtualization technology under this license, 
including VMS from VMware. 


Ready for Prime Time? 

My tests show that virtualization technology is ready 
for SQL Server. Today's virtualization platforms can 
support very demanding workloads. It doesn't take 
a crystal ball to see that in the next couple of years 
many SQL Server systems will be going virtual. EI 


InstantDoc ID 99281 


Michael Otey 


(mikeo @ windowsitpro.com) is technical 
director for Windows IT Pro and SQL Server 
Magazine and author of Microsoft SQL Server 
2008 New Features (Osborne/McGraw-Hill). 


July 2008 7 


Raul Pubill 


Nifty Ways to Use 


FOR XML PATH Concatenation 


| Inside Microsoft SOL Server 2005: T-SOL Pro- 
n gramming (Microsoft Press, 2006) by Itzik Ben- 
Gan, Dejan Sarka, and Roger Wolter, I was delighted 
to read about a clever way of using SQL Server 2005's 
FOR XML PATH option to concatenate string values 


LISTING I: Traditional Query 
to List Indexes and Columns 


SELECT table name = OBJECT_NAME(i.[object_id]), 
ic.index id, index name = i.name, 
C.column id, column name = c.name, ic.key ordinal 

FROM sys.index columns ic 

JOIN sys.columns c ON ic. [object id] = c. [object id] 
AND ic.column id = c.column id 

JOIN sys.indexes i ON i. [object id] = ic. [object id] 
AND i.index id = ic.index id 

WHERE c. [object id] = OBJECT_ID(*SalesLT.Address”) 

ORDER BY i.index id, ic.key ordinal 


LISTING 2: Using FOR XML PATH 
to List Indexes’ Columns in a 
Concatenated String 


SELECT ‘,’ + c.name AS [textO] 

FROM sys.index columns ic 

JOIN sys.columns c ON ic. [object id] = c. [object id] 
AND ic.column id = c.column id 

WHERE ic.[object id] = OBJECT ID(“SalesLT .Address') 

ORDER BY ic.key ordinal 

FOR XML PATH(“') 


table name index id |index name 
Address PK Address AddressID 
Address Ak- Aadress -rowguid 
dd 
B is Cin Salacos 
Address (Code CountryRegion 
S = 
Line2_City_StateProvince_ 
Address (Code CountryRegion 
E S = 
ssLine2 City StateProvince | 
Address (Code CountryRegion 
L S = 
ssLine2_City_StateProvince_ 
Address Code_CountryRegion 
S = 
Line2 City StateProvince 
Address (Code CountryRegion 
S = 
Line2_City_StateProvince_ 
Address (Code CountryRegion 
Address IX Address StateProvince 
Figure I 


Results from a traditional guery 


XML F52E2B61-18A1-11d1-B195-99895F49916B 


in rows into a single 
column or result set 
without using tem- 
porary tables or table 
variables (see page 
109). This unorthodox 
method of concat- 
enating string values 
can be guite useful in 
gueries. For example, 
you can use it to list all 
columns in an index or 
all files in a file group. 

Listing 1 shows a 
traditional query to 
list all the columns in 
the indexes for a table 
or view. In the results 
in Figure 1, notice 


column name 


¡AddressID 


oOo po | 
so | po | 
a 
ee ae 
k | 
eee ee 
ee ee 
pbs b 
CO po | 


CountryRegion 


StateProvince 


„AddressID, rowguid,AddressLinel, StateProvince,AddressLine2, 
City, StateProvince, PostalCode, CountryRegion 


Figure 2 
Results from using FOR XML PATH 
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ORE on the WEB 


Download the code at 
InstantDoc ID 99197. 


that the index IX Address AddressLinel Address 
Line2 City StateProvince PostalCode Country 
Region (index id 3) has six columns: AddressLinel, 
AddressLine2, City, StateProvince, PostalCode, and 
CountryRegion. 

You can use the FOR XML PATH option and 
the text() XPath node test (see SOL Server 2005 Books 
Online—BOL at msdn.microsoft.com/en-us/library/ 


ms345115.aspx) to list all the indexes’ columns in a single 


concatenated string, as shown in Listing 2. Figure 2 
shows the results from this guery. As you can see, a 
comma is prepended to each column name. Granted, 
having all the column names listed together like this is of 
limited use. However, you can refine this technigue so that 
it produces useable results. By using the handy STUFF 
string function with the FOR XML PATH option, you 
can get output like that in Figure 3, page 10. | 

In Figure 3, notice the row for the multi- 
column index IX Address AddressLinel Address 
Line2 City StateProvince PostalCode Country 
Region. The AddressLinel, Address- 
Line2, City, StateProvince, Postal- 
Code, and CountryRegion columns 
are all now in this row, with their 
names separated by a comma. Listing 
3, page 10, shows the guery that was 
used to produce this result. In this 
query, the STUFF function removes 
the comma preceding the first name in 
the comma-separated column list. 

The query in Listing 3 is only 
meant to demonstrate how you can 
concatenate the names of columns in 
multicolumn indexes. A query that you 
can use on your databases to list all 
the columns in every index is available 
online in the file named FullQuery 
sql. You can download this file by 
going to www.sąlmag.com, entering 
99197 in the InstantDoc ID text 
box, clicking Go, then clicking the 99197.zip hotlink. 

You can use the FOR XML PATH option with the 
STUFF function for other tasks. For example, you can 
use this technique to concatenate the names of files in a 
file group, as shown in Listing 4, page 10. In addition to 


key_ordinal 
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Share your SQL Server 
code, comments, 
discoveries, and solutions 
to problems. Email your 
contributions to r2r (D) 
sąlmag.com. Please 
include your full name 
and phone number. 
We edit submissions 
for style, grammar, and 
length. If we print your 
submission, you'll 
get $100. 

— Karen Bemowski, 

senior editor 


LISTING 3: Using FOR XML PATH with the STUFF 
Function to Concatenate the Names of Columns in 
Multicolumn Indexes 


SELECT ic2.[object id], index name = i.name, ic2.index id, STUFF( 
(SELECT ‘,’ + c.name AS [text()] 
FROM sys.index columns ic 
JOIN sys.columns c ON ic.[object id] = c. [object id] 
AND ic.column id = c.column_id 
WHERE ic.[object id] = ic2. [object id] 
AND ic.index id = ic2.index id 
ORDER BY ic.key ordinal 
FOR XML PATH(‘’) 
No als the DAS column list 
FROM sys.index_columns ic2 
JOIN sys.indexes i ON ic2.[object_id] = i.[object_id] 
AND ic2.index_id = i.index_id 
WHERE ic2.[object_id] = OBJECT_ID(‘SalesLT.Address’) 
GROUP BY ic2.[object_id], i.name, ic2.index_id; 


FOR XML PATH option 
and STUFF function, this 
query uses a derived table and 
the dynamic management 
view (DMV) dm_io_virtual_ 
file_stats to summarize sta- 
tistics for all the databases 
and all the files given that 
NULL, NULL parameters 
are passed to dm_io_virtual_ 
file stats. The DMV pro- 
vides detailed information 
for each file in the database. 
The query in Listing 4 groups 

the results by database, 


as Figure 4 shows. This 


can be handy for looking 


at databases from a high- 
level perspective so that, 


for example, you can 


object_id index_name index_id column_list 

37575172 PK Address AddressID 1 'AddressID 

37575172 AK_Address_rowguid 2 rowguid 

37575172 IX Address AddressLine1 3 AddressLine1,AddressLine2, 
AddressLine2 City StateProvince City,StateProvince, 
PostalCode_CountryRegion PostalCode,CountryRegion 

37575172 IX Address StateProvince 4 StateProvince 

Figure 3 


Results from using FOR XML PATH with the STUFF function to concatenate 
column names 


LISTING 4: Using FOR XML PATH with the 
STUFF Function to Concatenate the Names 
of Files in a File Group 


SELECT 
vfs.database id, 
DBName = DB NAME(vfs.database id), 
DB Reads = SUM(vfs.num of reads), 
DB Writes = SUM(vfs.num of writes), 
DB BytesRead = SUM(num of bytes read), 
DB IOStallReadMS = SUM(io stall read ms), 
DB BytesWritten = SUM(num of bytes written), 
DB IOStallWriteMS = SUM(io stall write ms), 
DB IOStall = SUM(io stall), 
DB BytesSize = SUM(size on disk bytes), 
DB Files = dt.file_list 
FROM sys.dm io virtual file stats(NULL, NULL) vfs 
JOIN ( 
SELEG 
vfs2.database_id, 
STUFF(( 
SELECT ‘,’ + mf3.name AS [text()] 
FROM sys.dm io virtual file stats(NULL, NULL) vfs3 
JOIN sys.master files mf3 ON mf3.database id = vfs3.database id 
AND mf3.file id = vfs3.file id 
WHERE vfs2.database id = vfs3.database id 
FOR XML PATH(‘’) 
Da Lo a AS IEA 
FROM sys.dm io virtual file stats(NULL, NULL) vfs2 
JOIN sys.master files mf2 ON mf2.database id = vfs2.database id 
GROUP BY vfs2.database id, DB NAME(vfs2.database id) 
) dt ON dt.database id = vfs.database id 
GROUP BY vfs.database id, DB NAME(vfs.database id), dt.file Tist 
ORDER BY DB NAME(vfs.database id) 


easily determine which 
database has the highest 
reads. You'd just need 
to sort the results by 
descending DB Reads. 
Note that no tem- 
porary tables or table 
variables were needed to 
perform these nifty con- 
catenation operations. 
As Inside Microsoft SOL 
Server 2005: T-SOL 
Programming mentions, 
not only is the resulting 
query “short and slick” 
but it is also high per- 
forming. With these 
examples as a starting 
point, can you think of 
other creative ways to 
use FOR XML PATH 
to display concatenated 
string results? SQL] 
—Raul Pubill, 
senior database 
administrator, 
JP Morgan FCS 
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DataBase_id DBName DB_Reads DB_Writes DB_BytesRead DB IOStallReį DB BytesWritten DB IOStallWriteMS DB IOStall DB BytesSize DB Files 
AdventureWorksLT Data, 
9| AdventureWorksLT 186 9 1835008 1048 36864 8 1056 40894464 | AdventureworksLT_Log 
1|master 147 25 3051520 4176 200704 168 4344 6815744| master,mastlog 
3|model 45 5 1794048 506 28672 20 526 1769472| modeldev,modellog 
4|msdb 71 0 942080 629 0 0 629 10551296| MSDBData,MSDBLog 
2|tempdb 55 7 786432 248 47104 10 258 2621440|tempdev,templog 
Figure 4 


Results from using FOR XML PATH with the STUFF function to concatenate filenames 
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Letters to the Editor 


Allopinions expressed in this letter are those of Idera and do not 
reflect the opinions of SOL Server Magazine or Penton Media. 
No responsibility is accepted by Penton Media for the accuracy 
of any statements, opinions, or advice contained in this letter, 
and readers should rely upon their own enquiries when making 


any decisions affecting their own interests. 


A Misfire in the Shootout? 

The April edition of SOL Server Magazine contained a 
comparative review [ “SQL Server Backup Compression 
Shootout,” April 2008, InstantDoc ID 98180] of three 
backup and recovery products for SQL Server from 
HyperBac, Red Gate, and Idera. At Idera, we respect 
and appreciate the integrity of your publication and the 
technical abilities of your review team; however, in this 
particular review, we believe the results to be misleading. 
The Idera SQLsafe backup and recovery product pro- 
vides a powerful feature set addressing performance, 
manageability, and scalability. These features do not 


appear to have been effectively exercised and reviewed 
consistently across the three products. 

Idera SQL safe is a powerful backup and recovery 
system for SQL Server that provides considerably 
more than just backup compression. It is designed to 
automate, manage, and audit backup and recovery 
Operations across tens or hundreds of servers often 
performing millions of backups per year. Had we more 
clearly understood the criteria and goals of the review, 
we would have submitted the free edition of our SQL- 
safe product, which has a footprint and functionality 
set more comparable to Red Gate and HyperBac. 

Given the above, we would ask your readers to 
review our more detailed response in the online edition 
of the article and then download the products and try 
them out for themselves in their own environment, the 
true test of any product. [SQL 

— Heather Sullivan, product manager, Idera 
InstantDoc ID 99158 


Sąlmag.com Community Dialog 
What Do SQL Server Magazine Readers Want? 


Readers Respond: Same Price, 
but Fewer, Longer Articles 

In May 2008, we ran two Instant Polls on the Sqlmag.com 
home page to get some feedback from readers on whether 
they'd pay more for a larger SOL Server Magazine and 
whether they’d prefer fewer, longer articles or more, shorter 
articles in the print version of the magazine. Although 
the polls didn’t get enough responses to make them 
anything more than an interesting sampling of reader 
opinion, they give us some idea of the kind of magazine 
you'd like us to provide. 

First, we put this question to readers: “Would you 
pay more for a 64-page SOL Server Magazine (current 
length is typically 48 pages)?” Out of 61 responses, 
just over half—S2 percent—answered “No, the content 
isn't worth an increase in price.” But slightly less than half— 
47 percent—said they would pay some amount more for 
a SOL Server Magazine with more pages. (Percentages 
don’t add up to 100 because our polling tool rounded 
them off.) You can see the complete poll results at www 
„sąlmag.com, InstantDoc ID 99272. 

We posted a follow-up poll asking readers whether 
they'd prefer a magazine version with fewer articles 
containing more content (e.g., text, code, figures) or one 
with more articles that are shorter, with more links to 
additional content on the Sąlmag.com Web site. In this 
poll, which had fewer responses than the previous one, 
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63 percent of readers said they'd prefer fewer articles. 
This result syncs with comments some of the other 
editors and I have heard from SOL Server Magazine 
readers, who tell us they like having all the content in 
print and would rather not have to go back and forth 
between magazine and Web site to read an article. 

So what's the outcome of these surveys for you, the 
reader? First, although we aren't planning to increase 
the subscription price for SOL Server Magazine 
anytime soon, it's good to get some idea of what our 
readers think about the current rate and whether they 
find the magazine's content valuable. Second, were 
making an effort to include all relevant code, figures, 
and other elements in SQL Mag articles. We want to 
provide as much content as possible in the print issues 
because we know many of you like having all the con- 
tent handy in one magazine. Still, we want our readers 
to discover the wealth of information on Sqlmag.com, 
including our print article archives as well as the thou- 
sands of Web-exclusive articles that are available only 
online. Judging from the growing number of reader 
comments I’ve seen on Sqlmag.com articles, Pd say 
that SOL Mag readers are getting active online, too. 

SQL} 

—Anne Grubb, Web site strategic editor, 
SOL Server Magazine 

InstantDoc ID 99272 
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LISTING I: Verifying and 
Enabling AWE 


-- Checking for AWE: 


EXEC sp configure ‘show advanced options’, 1 


RECONFIGURE 
GO 


-- A config_value of 0 means that AWE is not 


-- configured 


-- A run value of 0 means that AWE is not enabled 
EXEC sp_configure ‘awe enabled’ 


GO 
-- Enabling AWE: 


-- (Make sure to grant “lock pages in memory” to 
-- SQLService account and restart SQL after 


-- this change) 


EXEC sp configure ‘awe enabled’, 1 


RECONFIGURE 
GO 
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Performance», 


SECRETS: 


for 


ne of the primary things that I do as a 
O consultant is to help companies evaluate 

the gap between where their current SQL 
Server deployments are and where they should be 
in relation to industry best practices. By offering a 
number of different auditing services and options, 
T help clients validate code, optimize performance, 
and evaluate their needs for future growth. And 
because many of my audits are holistic in nature, 
I’ve gained great insights over the years into some 
common developer pitfalls and mistakes that can 
adversely affect performance. This article focuses 
on 10 common performance problems (presented 
in no particular order) that developers can avoid 
with a modicum of effort. 


Failure to Properly Configure 
AWE Memory Usage 

Although configuring server memory is obviously an 
IT or DBA responsibility, I’ve seen too many environ- 
ments where “poorly performing code” was actually 
the result of a system with 8GB (or more) of RAM 
restricting SQL Server to only 2GB. In fact, from 
what I’ve seen, I’d wager that this problem is so wide- 
spread that more than half of production SQL Server 
deployments suffer from 
it. Of course, by help- 
ing clients remedy this 
situation, I end up be- 
ing a hero; adding more 
RAM is typically one of 
the easiest ways to boost 
performance, as long as 
SQL Server can use the 
memory. 

Happily, you, too, 
can be a hero by en- 
abling SQL Server to 
use more than 2GB of 


4 


RAM. It's a relatively simple operation—although I 
wish that the SOL Server installer offered to config- 
ure Address Windowing Extensions (AWE) during 
installation. To start, verify that SOL Server is using 
AWE, which you can do by executing the code shown 
in the first part of Listing 1. (Note that some lines in 
the listings are wrapped to fit on the printed page.) 
If you have more than 2GB of physical RAM and 
AWE isn't enabled, you can use the remaining code in 
Listing 1 to guickly configure SOL Server to use the 
additional memory. 

The trick to enabling AWE is to ensure that the ser- 
vice account used by SOL Server has been granted the 
Lock pages in memory user right, as Figure 1 shows. 
Without this critical configuration change, SOL Serv- 
er can't use more than 2GB of RAM, no matter how 
much physical memory is available. Also note that if 
you need to change this configuration, you need to 
restart the SOL Server service before you can realize 
the benefits. (For more information, see the Micro- 
soft article “How to configure SOL Server to use 
more than 2 GB of physical memory“ at support 


„microsoft.com/?kbid=274750, which applies to 


SOL Server 2005 and SOL Server 2000.) 


Failure to Normalize 
Many performance-tuning articles suggest that you 
intentionally denormalize databases for the sake of 
performance (to what I jokingly call 2 % normal 
form). Although internal denormalization is a val- 
1d approach that's covered elsewhere, it's also a sad 
fact that databases are occasionally deployed by 
developers who either don't know anything about 
normalization or who fail to grasp its importance. 
In these cases, the extra business logic required 
to extract useful information from jagged Franken- 
columns of tangled data adds significant processing 
overhead. However, this additional overhead usu- 
ally doesn't become a performance problem until 
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Avoid the 
performancė 


the database becomes heavily used. Of course, by 
that time there's typically no easy or cost-effective 
way to remedy the situation. Therefore, although it 
sounds trite, one of the best things that developers 
can do to ensure good performance is start with a 
solid, normalized foundation. The cost and head- 
ache of doing otherwise can guickly become a true 
performance nightmare. 


Failure to Use 

Clustered Indexes 

When you're looking to squeeze seconds or milli- 
seconds of additional performance out of your 
code, the prospect of shutting down your produc- 
tion database for a few hours to correct a perfor- 
mance problem probably doesn't make much sense. 
But doing so is the only way to remedy common 
performance problems stemming from the lack of 
a clustered index. 

As with normalization problems, failure to use 
clustered indexes is a performance issue that typi- 
cally won't surface until a table starts to get fairly 
large (say, a few million rows). But by that time, 
the table has typically become heavily fragmented 
(both physically and logically, through the use of 
forwarding records), and queries involving book- 
mark operations can become slow enough to draw 
the attention of end users and management. Sadly, 
at this point the only available options are to either 
re-create the table or try to slap a clustered index 
on it. Both operations can take hours (though re- 
creating the table is typically faster), and end users 
can't use the table while the data is being restruc- 
tured—which in most cases renders your database 
useless for lengthy periods. 

Therefore, it's a good rule of thumb to make 
sure that all tables have a clustered index. Likewise, 
given the fact that a well-placed clustered index can 
drastically improve ranged queries, it's also a good 
practice to ensure that any table with more than 
20,000 rows has a well-considered clustered index. 


Failure to Optimize 

If you're interested in improving the performance 
of your code, your efforts will only go so far if your 
database is missing key indexes, has heavy table or 
index fragmentation, or doesn’t have accurate statis- 
tics. Furthermore, since some DBAs are either too 
swamped to optimize index placement (or consider 
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IN ommon 
oblems 


it a developer responsibility), it behooves 
developers to ensure that foreign keys and 
other joined columns are indexed on both 
sides of the join (i.e., in both tables). 

Likewise, for developers whose code is 
being fingered for performance problems, it’s always 
good policy to verify that indexes are being regularly 
rebuilt and defragmented and that statistics are be- 
ing regularly updated. Doing so can help you avoid 
wasted effort. Because I find that “failure to DBA” 
is a common problem in many engagements where 
I’m called in to help tune “poorly running code,” I’ve 
provided two scripts (Web Listings 1 and 2 at www. 
„sąlmag.com, InstantDoc ID 99148) that you can use 
to help evaluate statistics and index fragmentation. 
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Failure to Create 

SARGable Queries 

Highly optimized queries against large tables in SQL 
Server take advantage of index-seek operations in- 
stead of more expensive scanning operations involving 
tables and indexes. (The difference between these op- 
erations is analogous to seeking against the alphabet- 
ized listing of last names in your phone book to find 
every “Campbell” versus scanning the phone book for 
every “Michael”—the difference can be expressed 
in orders of magnitude.) Accordingly, queries that 
can take advantage of index-seek operations are 
commonly referred to as SARGable queries (where 
SARG is a word play on Search ARGument). 


Figure | 


ORE on the WEB 


at 


InstantDoc ID 99148. 


Enabling the Lock pages 


in memory user right 
via the Local Security 
Policy window 
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LISTING 2: Creating a 
SARGable Query 


-- environment setup: 

-- Create an additional index for demonstration 
-- purposes 

USE AdventureWorks 

GO 


CREATE NONCLUSTERED INDEX 

[IX Contact ModifiedDate ] 
ON [Person].[Contact] (ModifiedDate) 
G0 


/* SARGable Queries use the following pattern: 
Column Op Const/Variable 
Where: 

Column is the column to evaluate. 

Op is an operator such as =,>=,<=,>, <, 

and BETWEEN. 

Const/Variable is either a constant or 

a variable. 


LIKE is also SARGable, but only when wildcards 
follow initial text. 
Therefore the following would be valid: 
LIKE ‘J0%’ 
But this would not be valid: ‘%oh%’ 


Highlight each query example and press F5 
to execute, or press CTRL+K to see the 
estimated execution plan. 


*/ 


-- Valid SARGs: 

-- (execute for results, select and press 
-- CTRL+K for plan) 

SELECT CustomerID FROM Sales.Customer 
WHERE AccountNumber = ‘AW@9014413’ 

OR TerritoryID = 5 OR CustomerID = 5008 


-- Slightly more complex examples of valid SARGs: 


SELECT ContactID FROM Person.Contact 
WHERE ModifiedDate > *2005-01-01” 
OR EmailAddress LIKE ‘steve%’ 


-- Non-SARGable WHERE clause examples: 

-- (returns same results as above, 

-- 20x more expensive) 

SELECT ContactID FROM Person.Contact 
WHERE DATEPART(yyyy ,ModifiedDate) >= 2005 
OR EmailAddress LIKE ‘%teve%’ 


-- cleanup 

DROP INDEX [IX Contact ModifiedDate ] 
ON [Person]. [Contact] 

G0 


LISTING 3: Optimizing Cursor 
Declarations 


USE AdventureWorks 


DECLARE @City NVARCHAR (39) 


DECLARE myCursor CURSOR LOCAL READ ONLY 
FAST FORWARD FOR 

SELECT DISTINCT City 

FROM Person.Address 


-- open and use cursor 

OPEN myCursor 

FETCH NEXT FROM myCursor INTO (City 

WHILE GGFETCH STATUS = Ø BEGIN 
PRINT @City 


FETCH NEXT FROM myCursor INTO @City 
END 


-- make sure to close and deallocate: 


CLOSE myCursor 
DEALLOCATE myCursor 
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Creating SARGable 
gueries might sound 
a bit spooky to the 
uninitiated, but once 
you learn a few simple 
syntax rules and ensure 
that a viable index is in 
place, creating SARG- 
able queries actually 
becomes second na- 
ture. Happily, you can 
find a wealth of re- 
sources (both online 
and in print) that cover 
this topic in depth, but 
for a quick overview, 
see Listing 2. 


Wasted 
Network 
Bandwidth 

SQL Server's default be- 
havior is to constantly 
update clients about 
current execution status 
while a query, batch, or 
operation is being pro- 
cessed. Although ben- 
eficial in some cases, this 
chattiness adds addition- 
al network overhead. 
Therefore a good rule of 
thumb is to ensure that 
stored procedures called 
by applications take 
advantage of the SET 
NOCOUNT ON direc- 
tive to turn off status 
messages. Doing so can 
cut back significantly on 
network traffic and, in 
the words of SOL Serv- 
er Books Online (BOL), 
“provide a significant 
performance boost.” 
Web Listing 2 shows an 
example of using SET 
NOCOUNT on. 

In a similar vein, 
developers should also 
avoid SELECT * que- 
ries when returning re- 
sult sets, as these result 
sets typically return 
much more data than 
the calling application 


needs. Therefore, a great way to optimize code is to 
make sure that you're returning only columns that 
you need; failing to do so results in additional net- 
work overhead with no benefits. Likewise, another 
good rule of thumb is to try to always horizon- 
tally restrict returned data to only what's needed 
(through efficient use of the WHERE clause). 


Improper Use of Cursors 

Sadly, cursors are prone to abuse. But as long as 
cursor-based operations aren't causing massive per- 
formance problems (or can't be easily rewritten using 
set-based logic), I typically encourage clients to use 
optimized declaration syntax and ensure that cursors 
are closed and de-allocated as soon as possible. A key 
consideration that makes this recommendation possi- 
ble is the fact that most developers intend to use light- 
weight, single-direction cursors for simple looping 
operations. However, although this might be their 
intention in most cases, developers will end up with 
a much more expensive cursor unless they use very 
explicit syntax. 

The fix for this problem is usually simple: Use 
more verbose syntax (as Listing 3 shows). By ex- 
plicitly declaring lightweight cursors designed for 
use by the current connection only, and by ensur- 
ing that cursors are closed and de-allocated (espe- 
cially when nested) as soon as possible, you can 
overcome many of the problems typically associ- 
ated with cursors without spending a lot of devel- 
opment effort. 


Failure to Use 

Full-Text Indexing 

In cases where SARGable queries aren't possible 
due to a requirement to search for %conditions% 
in the middle text columns, full-text indexing can 
be a lifesaver. Without full-text indexing, the best 
you can hope for is an index scan (which requires 
less I/O than a full table scan). But against millions 
of rows, even an index scan can gobble up excessive 
amounts of CPU resource, which can adversely 
impact systemwide performance. 

By tokenizing text data with a full-text index, 
not only can you drastically decrease the size of 
the indexes used for queries, but SQL Server can 
provide results much closer to those you'd an- 
ticipate with an index seek. The result is that, in 
certain scenarios, full-text indexing can yield que- 
ries that require substantially less processing and 
return results at a fraction of the cost imposed by 
their traditional counterparts. For example, the 
data in Figure 2, page 16, was taken from an envi- 
ronment where more than 8 million rows were be- 
ing pummeled up to 12 times a second by a fairly 
complex query. 
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"Nobody's perfect. 
But SOL Backup makes 
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Without Full-Text Indexing 
Index Size: 533MB 
Query Execution Cost: 27 
Average CPU/Execution: 988 
Elapsed Execution Time: 5-7 seconds 


With Full-Text Indexing 


Index Size: 133MB 

Query Execution Cost: .27 (10M times less expensive) 
Average CPU/Execution: 175 

Elapsed Execution Time: Ø seconds 


Figure 2 Best of all, these improvements can also free 


up existing I/O and CPU resources—which trans- 
lates to better overall system performance. There- 
fore, although full-text indexing can’t solve every 
problem, you should evaluate it as a possible way 
to boost performance if you're doing intensive text 
searches. There’s also a slight performance hit in- 
volved with turning full-text indexing on, so make 
sure you're getting tangible benefits by going this 
route. 


Tangible benefits of 
full-text indexing 


Failure to Use 
Multiple Data Files 
A key performance enhancement that many orga- 
nizations fail to take advantage of is SQL Server’s 
support for multiple data files. In scenarios in 
which a database is larger than the amount of 
physical memory available, consider using multiple 
data files, especially if the server in question has 
multiple processors and one or more RAID con- 
trollers. Furthermore, the benefits of using mul- 
tiple data files are even more pronounced in cases 
where most of the database’s data is housed in a 
handful of heavily used tables. 
Finding the right number of data 
files to use can require some testing, 
but as long as you don’t go crazy 
adding secondary data (.ndf) 
files, it’s hard to go wrong. 
In fact, Microsoft recom- 
mends a ratio of roughly 
0.25 to 1 data file per 
file group per proces- 
sor. Why? Because add- 
ing more files lets SQL 
Server take advantage 
of parallel processing 
by delegating expensive 
read operations to dedi- 
cated I/O threads. The more 
threads—up to a point—the 
better. Therefore, moving a 
heavily used table to a dedicated 
file group, and its nonclustered in- 
dexes to another dedicated file group, can 
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provide serious performance benefits—especially 
if you can place the files for these file groups on 
different physical disk arrays or spindles. 


Failure to Properly 

Size Data Types 

Failure to properly size data types can result in cost- 
ly performance problems that can be difficult to di- 
agnose and correct. This problem partly stems from 
the fact that many developers don’t realize that as far 
as SQL Server is concerned, a VARCHAR(20) is a 
different data type than a VARCHAR(30). Throw- 
ing NVARCHAR versus VARCHAR and NULL 
versus NOT NULL into the mix only complicates 
things further. Masking the problem, however, is the 
fact that SQL Server typically handles subtle differ- 
ences by transparently converting one data type to 
another in most cases. 

Where this can cause problems, however, is when 
coercion causes SQL Server to choose an index or 
table scan instead of a much better-performing in- 
dex seek. On smaller tables, these effects are typi- 
cally negligible; however, as tables begin to grow, 
the effect can become much more pronounced. 
In some cases, this means that coercion can be a 
“silent thief” that slows moderately sized queries 
to a degree that usually goes unnoticed. How- 
ever, in extreme cases, I’ve seen examples where a 
stored procedure declared an NVARCHAR(50) 
parameter and used it ina WHERE clause against 
a VARCHAR(50) column, which resulted in an 
index scan that was making the query take four 
seconds to execute. By simply changing the param- 
eter to match the table’s data type, SQL Server was 
able to revert to an index seek that completed in 
milliseconds. Therefore, although the frequency 
of performance issues caused by this problem is 
typically very low, it’s always a good idea to pay 
close attention to data-type declarations and siz- 
ing when coding to avoid having your code become 
a sad statistic in the war on optimizing database 
performance. 


Performance 
Considerations 
Optimizing SQL Server performance is a vast topic 
that you can approach from many different angles. 
This article focuses on a smattering of common 
problems that are typically fairly easy to remedy. My 
hunch is that most SQL Server deployments suffer 
from a handful of issues outlined in this article—but 
it’s my hope that this article has also gotten you to 
think a bit more about ways to tackle performance 
issues while you code, rather than after the fact. 
SQL] 
InstantDoc ID 99148 
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Use a new data type 


HierarchyID 


to maintain and query hierarchial data 


Server 2008 introduces the 
HIERARCHYID data type, 
which is designed to support 


storing and manipulating data representing a hier- 
archy. Examples of hierarchies that you might need to 
represent in a SQL Server database include employee 
organizational charts, folder lists, forum discussions, and 
so on. This article is the first in a two-part series about 
the new HIERARCHYID data type. This month I'll 
introduce you to the new data type, and I'll explain how 
to define and index the table that will hold the hierarchy 
data, how to insert new nodes into the table, and how 
to query the hierarchy. Next month I'll explain how to 
handle changes in the hierarchy (e.g., reparenting nodes) 
and how to convert a traditional parent-child hierarchy 
(aka an adjacency list) to a representation with the new 
HIERARCHYID data type. 


Creating and Populating 

the Hierarchy 

The HIERARCHYID data type provides the means 
to store and manipulate data representing a hierarchy. 
HIERARCHYID is a system-supplied CLR type with 
a set of methods that allow manipulation of the type. 
The methods supported by the type include GetLevel, 
GetRoot, GetDescendant, ToString, Parse, IsDescen- 
dantOf, GetAncestor, GetReparented Value, Read, and 
Write. Pll describe some of the methods this month 
and some next month. 

The type internally holds a binary value as large as 
892 bytes that represents the path of ancestors leading 
to the current node. The value positions the current 
node among other nodes in the hierarchy—both in 
terms of parent-child positioning, and in terms of 
positioning among siblings. For example, if you use the 
HIERARCHYID data type to represent a hierarchy 
of employees, the value holds the path of all managers 
leading to the current node; the value represents the 
correct manager-subordinate relationships, and even 
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the correct position among direct subordinates of the 
same manager, in case there's a need for it. 

The HIERARCHYID type provides topological 
sorting—a term taken from graph theory—meaning 
that the value of a node is greater than the values of 
all ancestors of the node and less than the values of 
all descendants of the node. The fact that the type pro- 
vides topological sorting means that when sorting by 
HIERARCHYID values, a child would always appear 
in the output after the parent. 

Pll use a hierarchy of employees in my examples. Run 
the code in Web Listing 1 (www.sqlmag.com, InstantDoc 
ID 99036) to create the Employees table, as well as 
indexes on the table to support querying the hierarchy. 
The hid column in the Employees table (see Web Table 1) 
is of a HIERARCHYID data type, and it represents the 
position of the current employee in the hierarchy. Notice 
the computed persisted column called lvl that is defined 
by a call to the GetLevel method of the hid column. 
The GetLevel method returns the level of the node in 
the hierarchy, starting with 0 for the root, 1 for the direct 
subordinates of the root, and so on. 

Besides creating the Employees table, 
Web Listing 1 also contains statements that 
create two indexes to support querying the 
hierarchy; the first is a unique clustered index 
created on the hid column and the second is 
a unique nonclustered index created on the lvl and hid 
columns. These indexes implement two indexing strate- 
gies called depth-first and breadth-first, respectively. 
The depth-first index organizes all members of a sub- 
tree (manager and all subordinates in all levels) close to 
each other; therefore it will support requests to return 
a whole subtree. This index will also support requests 
to sort the hierarchy by the hid column (in topological 
order). The breadth-first index organizes all members 
of the same level close to each other; therefore it will 
support requests for siblings from the same level (e.g., 
a request for all direct subordinates of a manager). 
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HIERARCHYID 


Later in the article, Pll demonstrate such requests and 
the way they are optimized. 

To insert a new row into the Employees table rep- 
resenting a new employee in the hierarchy, you need to 
generate a HIERARCHYID value for the hid column. 
The value should represent the correct position of 
the employee in the hierarchy. You can use methods 
of the type to help in this task. For the root of the 
hierarchy—the big boss or the CEO—you can use the 
HIERARCHYID::GetRoot static method. This method 
simply produces a HIERARCHYID value that is inter- 
nally an empty binary string representing the root of the 
tree. If the new node you are adding is not the root, you 
can produce the HIERARCHYID value for the new 
node by invoking the method GetDescendant on the 
hid value of the manager of the new node. This method 
accepts two HIERARCHYID inputs; Pll call them 
@lft and @rgt. If both values are not NULL, the 


THE LOGICAL PUZZLE 


Solution to June’s Puzzle: 
Josephus Problem 

The Josephus problem is an ancient puzzle that 
involves a group of 41 men standing in a circle. 
Going around the circle, every second standing man 
is executed (one skipped, one executed) until only 
one man is left standing. Assuming that the positions 
are numbered 1 through 41, which position should 
Josephus (one of the men) choose if he could, so 
that he would be the only one to remain standing? 
Can you generalize the solution for n men? Write a 
T-SQL solution that returns the position based on the 
input number of men @n. 

An easy way to find a generic solution to this 
puzzle with any number of men is to first solve it 
with very small numbers of men (1, 2, 3, and so on), 
and to look for a pattern in the results. If you solve 
the puzzle for small numbers, you get the results 
shown in Web Table A (www.sqlmag.com, InstantDoc 
ID 99039), where n is the number of men, and p is 
the position of the only man left. 

The pattern you can identify is that p is an increas- 
ing sequence of odd integers that restarts from 1 
when n is a power of 2. You express n as 24a + b, 
where b >= O and b < 24a. That is, a is the highest 
power of 2 such that 24a is smaller than n, and b is n 
minus 24a. Then, p can be expressed as 2b + 1. For 
example, for n = 41, you express n as 245 + 9. Since 
b=9 and p=2b+ 1, you get p= 19. 

Of course, this is just an observation of a pattern 
based on the cases that were tested. To ensure that 
the pattern holds for all cases, you need a mathemat- 
ical proof. You can find such a proof at en.wikipedia 
.org/wiki/Josephus_problem. The following T-SQL 
expression calculates p for a given @n: 


DECLARE @n AS INT; 
SET @n = 41; 
SELECT 2 * (@n - POWER(2, CAST(LOG(@n)/LOG(2) AS 
INT))) + 1 AS p; 
InstantDoc ID 99039 
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method produces a new 
value positioned between 
the two. If both are 
NULL, the method pro- 
duces a value simply below 
the manager at hand. If 
@lft is not NULL and 
@rgtis NULL, themethod 
produces a value greater 
than @lft. Finally, if 
@lft is NULL and @rgt 
isnot NULL, the method 
produces a value smaller 
than @rgt. 

Run the code in Web 
Listing 2 to create the usp_ 
AddEmp procedure that is 
in charge of adding a new 
node into the hierarchy. If 
the new employee is the 
root employee, the proce- 
dureuses the HIERAR- 
CHYID::GetRoot static 
method to produce the 
HIERARCHYID value 
for the new employee. If 
the new employee is not 
the root, the procedure 
first obtains the hid value 
of the new employee’s 
manager, as well as the 
maximum hid value 
among the direct subordi- 
nates of that manager. The 
procedure then uses the 
GetDescendant method 
to produce the hid value 
of the new employee as 


the last node under the given manager. 

To populate the Employees table with sample data, 
run the code in Web Listing 3. As long as the hierarchy 
doesn’t contain a large number of levels, the HIERAR- 
CHYID values are quite economical. 

The HIERARCHYID data type also exposes a 
method called ToString that produces a canonical 
string representation of the hierarchical value as a 
path using a slash as a separator between levels. Unlike 
the raw binary representation of a HIERARCHYID 
value, the string representation produced by ToString 
is more meaningful to the naked eye. For example, the 
ToString method returns the string ‘/2/1/1/3/ for the 
HIERARCHYID value 0x6AD6F0. You can easily 
see in the string representation of the value that the 
node is positioned at level 4, and that it's the child of 
the parent node with the path */2/1/1/. To convert a 
given canonical string representation of a hierarchical 
value to HIERARCHYID, you can either use the 
static method HIERARCHYID::Parse, or simply use 
the CAST or CONVERT functions. For example, the 
expression CAST(/2/1/1/3/ AS HIERARCHYID) 
returns the HIERARCHYID value 0x6AD6F0. 

To return the current contents of the Employees 
table, along with the logical string representations of 
the hid values, run the following guery: 


SELECT hid, hid.ToString() AS path, lvl, empid, empname, 
salary 

FROM dbo.Employees 

ORDER BY hid; 


This guery returns the output shown in Web Table 1. 


Querying the Hierarchy 

Now that Pve covered how to create and populate the 
hierarchy, I'll explain how to query the hierarchy to 
answer common requests such as sorting the hierarchy, 
and returning a subtree, a path, direct subordinates, 
and so on. 

Remember that earlier I mentioned that the HIER- 
ARCHYID type provides topological sorting. So if you 
want to sort the employees such that a subordinate will be 
returned after a manager, you can simply sort by the hid 
column. You can use the lvl column to produce indenta- 
tion, replicate some string lvl times, and concatenate the 
employee name. The following query demonstrates both 
sorting topologically and producing indentation: 


SELECT empid, REPLICATE(* | ‘, lvl) + empname AS emp 
FROM dbo.Employees 
ORDER BY hid; 


Web Table 2 shows the output of this query, and Web 
Figure | shows the execution plan. Notice in the plan 
that the depth-first index created on the hid column 
is used here efficiently, preventing the need for a sort 
operation. 
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To return a manager and all his or her subordinates 
in all levels, you can use the IsDescendantOf method. 
This method is invoked for a given HIERARCHYID 
value, and accepts a HIERARCHYID value as input. 
This method returns 1 if the input employee is a 
descendant of the given employee, and O if it isn't. As 
an example, the following query returns Ina (employee 
3) and all of her subordinates in all levels: 


SELECT C.empid, C.empname, C.1v1 
FROM dbo.Employees AS P 
JOIN dbo.Employees AS C 
ON P.empid = 3 
AND C.hid.IsDescendantOf(P.hid) = 1; 


The query joins two instances of the Employees table, 
one called P, representing the parent/manager, and 
the other called C, representing children/subordinates. 
The ON clause filters only the row for employee 3 
from the P instance, and using the IsDescendantOf 
method, filters all subordinates of employee 3 from the 
C instance. Web Table 3 shows the output of this query, 
and Web Figure 2 shows the execution plan. 

Notice in the plan that the depth-first index 
created on the hid column is used here efficiently, 
scanning the consecutive range of rows at the leaf 
with Ina (employee 3) and all her subordinates in all 
levels. The range appears in the plan as >= Phid and 
<= Expr1004. If you inspect the Compute Scalar 
operator that calculates Expr1004, you will find that 
it represents P.hid.DescendantLimit(), which is the 
maximum possible value under P.hid. 

You can limit the number of levels of subordinates 
to return below the given manager by filtering the level 
difference between the subordinate and manager. For 
example, the following query returns Ina and all of her 
subordinates up to two levels below her: 


SELECT C.empid, C.empname, C.1v1 
FROM dbo.Employees AS P 
JOIN dbo.Employees AS C 
ON P.empid = 3 
AND C.hid.IsDescendantOf(P.hid) = 1 
WHERE C.1v1 - P.1vl <= 2; 


To return all managers of a given employee in all levels, 
you need to make minor revisions to the query that 
returns a manager and all subordinates in all levels. 
Simply filter only the given employee from the C 
instance (subordinates) instead of the P instance (man- 
agers), and return the attributes from the P instance. 
For example, this query returns Didi (employee 14) and 
all her managers in all levels: 


SELECT P.empid, P.empname, P.1v1 
FROM dbo.Employees AS P 
JOIN dbo.Employees AS C 
ON C.empid = 14 
AND C.hid.IsDescendantOf(P.hid) = 1; 
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HIERARCHYID 


The output of this query (i.e., Didi and her managers) 
is shown in Web Table 4. 

If you want to limit the number of levels to return, 
simply add a filter on the level difference between the 
subordinate and the manager. For example, the fol- 
lowing query returns Didi and two levels of managers 
above Didi: 


SELECT P.empid, P.empname, P.1v1 
FROM dbo.Employees AS P 
JOIN dbo.Employees AS C 
ON C.empid = 14 
AND C.hid.IsDescendantOf(P.hid) = 1 
WHERE C.1v1 - P.1vl <= 2; 


To get direct subordinates of a given manager, you 
will find the GetAncestor method useful. This method 
operates on a HIERARCHYID value (call it v), and 
accepts an integer (call 1t 1) as input. This method 
returns the HIERARCHYID value of the ancestor 
who is one level above v. As an example, the following 
query shows how to return all direct subordinates of 
Eitan (employee 2): 


SELECT C.empid, C.empname 
FROM dbo.Employees AS P 
JOIN dbo.Employees AS C 
ON P.empid = 2 
AND C.hid.GetAncestor(1) = P.hid; 


This query filters only the row for employee 2 from the 
instance P, and returns all employees from the instance 
Cfor whom the manager one level above is employee 2. 
The output for this query is shown in Web Table 5, and 
the execution plan is shown in Web Figure 3. Notice 
in the plan that the breadth-first index created on lvl 
and hid is used here. 

Similarly, if you want to return all subordinates of 
Eitan, two levels below, simply specify 2 as the input to 
the GetAncestor function: 


SELECT C.empid, C.empname 
FROM dbo.Employees AS P 
JOIN dbo.Employees AS C 
ON P.empid = 2 
AND C.hid.GetAncestor(2) = P.hid; 


More To Come 
You can use SQL Server 2008's new HIERARCHYID 
data type to maintain and query hierarchical data. In 
this article I discussed performance aspects of the data 
type and explained how different indexing strategies 
can support different types of requests. Next month 
TIl explain how you can reparent nodes, move com- 
plete subtrees, and convert a traditional parent-child 
representation of a hierarchy to one that uses the new 
HIERARCHYID data type. SQL] 
InstantDoc ID 99036 
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Solving Virtualization's Pitfalls 


Keeping Performance and Reliability on Your Virtual Machines 


irtualization has enormous potential 

to save companies money, time 

and space. However, like any in- 
novation, there are bound to be a few 
snags and snares that prevent you from 
maintaining optimum performance and 
reliability. While virtualization promises 
efficiencies, as a DBA you know that it 
hardly guarantees performance. The 
anticipated evolution to large scale vir- 
tualization can only be made effective by 
understanding and seamlessly integrat- 
ing this technology into the network. 
Success also hinges on the resolution 
of any reliability-threatening problems, 
like fragmentation, that will spring up. 


Fragmentation 101 

Fragmentation is a performance- 
crippling phenomenon. And, if left 
unchecked, it can lead to hangs, bottle- 
necks, freezes, crashes and total system 
failures. 

Disk fragmentation occurs when 
individual files such as mdf and .Idf, are 
not stored in contiguous segments. To 
fit space restrictions, files are broken up 
and scattered around the hard disk. So 
the file system, whether in a virtual ma- 
chine or not, has to generate more I/Os 
to address the excessive file fragments. 


Pinpointing the Problem 

Without a doubt, the more I/Os 
generated (and multiple virtual ma- 
chines only exacerbate this), the more 
throughput you'll need to account for in 
storage subsystems. Even with expansive, 
fully optimized SAN or RAID imple- 
mentations underlying those virtual 
machines, poor disk performance is still 
regularly traced back to fragmentation. 

Virtualization pundits have been 
cautioning for some time about the major 
performance hit a host operating system 
takes from fragmentation. Fragmenta- 
tion's side effects include I/O bottle- 
necks and the inefficient use of resourc- 
es. These are two major pitfalls along 
the path towards virtualization. They 
limit the performance and reliability of 
virtual machines. And server virtualiza- 
tion actually compounds fragmentation. 
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A fragmented virtual 
disk, in the same 
storage system as 
another fragmented 
virtual disk all on 
top of a fragmented 
logical disk, sig- 
nificantly increases 
the amount and 
severity of frag- 
mentation’s symp- 
toms for the entire 
architecture. 

A misguided and 
somewhat  instinc- 
tual response to 
fragmentation is to 
add more and more 
hardware (more hard drives, 
faster controllers) to fight off its effects. 
But this only mitigates the problem, and 
never solves it. 


Solving Fragmentation 

Diskeeper® 2008 with InvisiTasking™ 
technology automatically solves 
fragmentation in real time with ab- 
solutely zero overhead. Management 
and downtime are things of the past. 
The breakthrough of InvisiTasking 
allows for defragmentation to be a com- 
pletely transparent background process. 
Only untapped resources are used, so 
applications are never interrupted. 
Having Diskeeper running on your 
virtual machines eliminates I/O bottle- 
necks and ensures resources are used 
more efficiently. 

Combining virtualization’s economical 
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resource use with Diskeepers trans- 
parent system enhancements leads to 
maximum productivity. Diskeeper will 
purge your system of performance- 
wrecking fragmentation. Your systems, 
virtualized or not, will run with un- 
paralleled performance and superior 
resource efficiency. 

The move to virtualization saves money 
and strategically allocates resources. 
Putting Diskeeper 2008 on your vol- 
umes is vital for a problem-free virtual 
infrastructure. 
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SOL Server 2008 


Data Profiling 


Uncover the hidden treasure inside 
SQL Server 2008 Integration Services 


"ve noticed a distressing lack of coverage of SQL 
i Server 2008 Integration Services’ (SSIS) new 

data profiling feature, even though data profiling 
capabilities are a significant feature for data warehouse 
developers. Data profiling tools have been available for 
some time, but until now, SQL Server hasn’t natively 
had the capability. The relative lack of coverage of SQL 
Server 2008's data profiling capability truly makes it a 
hidden treasure. 

Data profiling is the process of examining source 
system data and collecting various statistics for data 
quality and data integration assessment. It has been 
a frequently requested item for SSIS—or at least the 
larger SQL Server business intelligence (BI) platform— 
for some time now. Microsoft is competing heavily in 
the BI industry against other BI platform providers 
(many of which have now consolidated) that have 
offered data-profiling capabilities for some time. This 
article shows how to set up the Data Profiling task, 
designate profiles, and build and view profile output in 
the new Data Profile Viewer utility. 


Available Profiles 

The SSIS Data Profiling task contains a total of 
eight data profiles that you can use to assess the data 
quality of source systems. Five of the profiles analyze 
individual columns in a table or view. The remaining 
three profiles analyze multiple columns or relationships 
between columns, tables, or views. 

At this time, it isn’t known if we'll be able to create 
custom data profiles in the release-to-manufacturing 
(RTM) build of SQL Server 2008. 

Some of the available profiles can only profile a 
particular data type. For example, if you designate a 
column-length profile on an integer-based column, 
you'll receive a design-time warning on the Data 
Profiling task. If you attempt to execute either an indi- 
vidual Data Profiling task or a package that involves 
an incorrectly configured Data Profiling task, you'll 
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get a dialog box informing you of the problem and the 
execution process will halt. 

Column-based profiles. The Column Length Distri- 
bution profile informs you of the various lengths for 
the values in a string-based column and the percentage 
(relative to the other distinct string lengths) those 
lengths represent. The Column Length Distribution 
profile supports columns with character data types. 
You can use the Column Pattern profile to determine 
the percentage of values in a column that match a 
particular profile. The Column Pattern profile supports 
character data types. If you need to review a column's 
NULL count, you can leverage the Column Null Ratio 
profile, which provides the count and percentage of 
NULL values in a given column. Because the Column 
Null Ratio profile is looking only for NULLS, it can 
analyze a column with any data type. 

Next up is the Column Statistics profile, which 
outputs a few basic statistics for a given column based 
on the data type. For numeric data types, the profile 
returns minimum, maximum, mean, and standard 
deviation values. For datetime data types, the profile 
returns the minimum and maximum values; the new 
Date and Time data types aren't supported. The 
Column Value Distribution profile informs you of the 
distinct values in a given column and the percentages 
those unique values represent. The Column Value 
Distribution profile supports columns with numeric, 
character, and datetime data types. 

Multicolumn, multitable profiles. There are three 
profile types in this category. The Candidate Key pro- 
file tells you a given column's or combined columns" 
key strength, which is a column's (or group of col- 
umns’) ability to uniquely identify individual records 
within a given table. The Functional Dependency pro- 
file tells you the extent to which values in one column 
(i.e., dependent column) depend on another column's 
(i.e., determinant column’s) values. For example, ZIP 
code is highly dependent on state. Finally, there is 
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the Value Inclusion profile, which informs you of the 
potential for a column’s or combined columns’ ability 
to serve as a foreign key that could enforce a data 
integrity relationship between two tables. Value Inclu- 
sion supports columns with the integer, character, and 
datetime data types. 


The Data Profiling Task 
on the designer surface 
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1 E Configure the properties used to profile data sources. 


General View All Requests ES 


the toolbox you will notice a new item called 
Data Profiling Task. Drag and drop an 
instance of the Data Profiling Task from the 
Microsoft Visual Studio toolbox onto the 
package’s Control Flow designer, as shown 
in Figure 1. 

Note that SOL Server 2008 Data Pro- 
filing can analyze only SOL Server 2000 and 
later systems, and you must use the ADO 
„NET Connection Manager. 

Before you configure the Data Profiling 
| task, you'll need to create an ADO.NET 
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|| 4 Correction Managers 
mm vil iais ika connection manager to the source system to 
| be profiled. With the Data Profiling task on 


the designer surface, right-click the task and 
select Edit. In the editor, you can optionally 
designate a timeout setting to dictate the 
length in seconds the task can run before timing out. 
Next are the destination settings—this is where usage 
comes into play. There are two primary uses of SQL 
Server Data Profiling: 
* performing ad-hoc formal source system 
assessments 
e performing recurring data quality checks 
as part of a larger extraction, transforma- 
tion, and loading (ETL) process 


If you're performing ad-hoc assessments, 
simply designate a file destination for your 
profile’s output. However, if you want to 


Profile Requests 
Expressions Profile Type Request ID | 


Candidate Key Profile Request 

Column Length Distribution Profile Request 
Column Null Ratio Profile Request 

Column Pattern Profile Request 

Column Statistics Profile Request 

Column Value Distribution Profile Request 
Functional Dependency Profile Request 


perform recurring data profiling as part of an 
ETL (or other) process, you'll want to load 
the output of your profiles into SSIS vari- 
ables. If you specify a variable destination, 
simply select or create a new SSIS variable. 
If you select a file destination, you'll create 
a new File Connection Manager and— 


Value Inclusion Profile Reguest 
R Properties: 


optionally—specify whether to overwrite an 


existing file. 


Now, to designate profiles, you can either 


select Profile Requests in the left pane of the 


Figure 2 


Configuring the 

Data Profiling Task 

As with other SSIS tasks, the Data Profiling task is reg- 
istered and installed during SQL Server 2008 setup. The 
task’s DLL is located at C:\Program Files\Microsoft 
SQL Server\l00\DTS\Tasks. The Data Profiling task 
is a SSIS feature; consequently, to use it you must first 
start up Business Intelligence Development Studio 
(BIDS). Once BIDS has been launched you will need 
to either create a new or open an existing SSIS project. 


Selecting a profile type 
in the Profile Requests 
section of the editor 
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editor or click the Ouick Profile button in 
the General section of the editor. The Ouick 
Profile route lets you select multiple profiles 
guickly that have default settings automati- 
cally applied. 

As Figure 2 shows, the Profile Reguests section has 
a Profile Type drop-down list from which you can select 
one of the eight available profiles. You can optionally 
provide a non-default Reguest ID for the profile and 
designate the profile's settings in the Reguest Properties 
section. Note that you can also filter the selected profiles 
by expanding the View drop-down menu at the top of 
the dialog box. If you select a profile type in the View 
drop-down list, the Reguest Properties section displays 
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the selected instances of the chosen profile type. After 
you designate the profiles to be used and their settings, 
click OK to complete the data profiling configuration 
process. Notice also that the editor's treeview pane has a 
section called Expressions. As with the other SSIS tasks, 
you can optionally assign values for the Data Profiling 
task properties at runtime. For example, if you wanted 
to supply a single Data Profiling task with different con- 
figurations at runtime, you could apply an expression to 
the task’s ProfileInputXML expression. 


Building and Viewing 

Profile Output 

To build the profiler's output, simply execute the Data 
Profiling task. If you designated a file destination, you 
can view the profiler's output by using a tool called 
Data Profile Viewer (DPV) that ships with SQL Server 
2008 Community Technology Preview 5 (CTP5). For 
now, DPV is a standalone executable that analyzes the 
output of the Data Profiling task. BIDS will probably 
include support for launching the tool in the final 
release. DPV doesn't automatically appear on the Win- 
dows Start menu as of CTP5, so I recommend creating 
a shortcut to the executable. The file is located at CA 
Program Files\Microsoft SQL Server\100\DTS\Binn 
and is called DataProfileViewer.exe. 

After you launch DPV, you can open a profiler 
output .xml file by clicking Open, the only command on 
the menu bar, and navigating to the desired file. This file's 
content will have been populated from a prior execution 
of a Data Profiling task. Figure 3 depicts DPV with 
output from a Column Value Distribution profile. 

As you can see in Figure 3, the DPV utility has two 
main sections. You use the left pane to choose which 
profile to review. The executed profiles are arranged by 
the type of database objects they were applied against. 
The right side of the DPV window shows you the details 
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of the selected profile. The output of each of the various 
profiles is unique to that profile, so you'll see different 
output in the details pane for each profile you select. 


The DataProfile.xsd Schema 

The Data Profiling task output adheres to the Data- 
Profile.xsd schema, which you can see at schemas 
.microsoft.com/sqlserver/2008/DataDebugger/Data 


Profile.xsd. The schema has three core sections: data 


sources, profile input, and profile output. The data 
sources section contains all the data sources used by the 
Data Profiling task. The profile input section is where 
each selected profile is stored, along with the tables, 
views, and columns it's applied to. The profile output 
section contains the results of each profile selected. 


A Data Profiling Example 

You now have a solid foundation in SQL Server 
2008 data profiling. Let's run through a sample of 
the feature in a test environment. Suppose that the 
AdventureWorks Company has charged you with 
building a new formal data warehouse system to sup- 
port organizational decision making. As one of your 
first steps, you'll need to assess the data quality of the 
Adventure Works OLTP database. In this example, 
you'll assess the data quality of the Production.Product 
table. Note that the Adventure Works sample databases 
are no longer included as part of the SQL Server setup; 
the SQL Server sample databases have been relocated 
to the Microsoft CodePlex portal (www.codeplex.com/ 
MSFTDBProdSamples). 

First, create a new SSIS package and drag an 
instance of the Data Profiling task onto the Control 
Flow designer. Next, create the ADO.NET connection 
manager to a local AdventureWorks OLTP database. 
Now configure the Data Profiling task. Because you're 
performing an ad-hoc analysis, send the output to a 
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usiness intelligence (BI) 
technology includes a wide 
variety of applications 

used to support decision- 
making in an organization, 
whether through reports 
summarizing daily 
operations, exploratory tools 
to analyze the root cause of a business problem, 

or predictive processes that forecast a future state 
based on historical results. Typically, to provide these 
capabilities to users, you a start by building a data 
warehouse using an Extract, Transform, and Load 
(ETL) tool to consolidate information from disparate 
sources, cleanse and transform the raw data into 
information suitable for business queries, and load 
the data into a schema optimized for querying. In 
addition to a relational data warehouse, you might 
add an online analytical processing (OLAP) tool to 
your technology infrastructure to support more 
complex, high-performance queries or add a data 
mining tool to reveal interesting patterns in the data. 
A presentation layer is then added to deliver the 
resulting information from your Bl platform to users. 
In this Essential Guide, we look at how the tools 
available in Microsoft SQL Server 2005 support these 
capabilities: 

e Integration Services 

e Analysis Services 

e Reporting Services 


Integration Services 

Integration Services in its simplest implementation 
moves data between data stores or automates 
database maintenance tasks, but in Bl scenarios it 
supports high-performance ETL. Extract operations 
involve the retrieval of data from one or more data 
sources. Transform operations include the cleansing of 
data, merging data from multiple sources, and making 
modifications to data as necessary to prepare it for 
use in analysis. Load operations insert the transformed 
data into data warehouse tables. 


The Integration Services architecture includes a 
Windows service, a run-time engine, and a data 
flow engine. The Windows service can start, stop, 
and monitor a package, which is the container 
for the instructions to perform one or more ETL 
Operations. You can also use the service to manage 
package storage on the file system or in the msdb 
database of the SQL Server. The service must be 
running to monitor executing packages or to 
import and export packages from storage, but it 
is not required to design or execute packages. The 


runtime engine is the component that executes 
packages and also manages rules applied to package 
execution, such as configuration information, and 
activities related to package execution, such as 
logging, transactions, and breakpoints. The data 
flow engine is an in-memory buffer architecture that 
manages the process of getting data from sources, 
applies changes to that data while in memory using 
transformation components, and then delivers the 
results to destinations. The key to speed and efficiency 
in Integration Services is the sequential operation 

of transformations on data in the buffer, which 
eliminates the overhead of read/write disk operations 
while data is cleansed and prepared for the data 
warehouse environment. 


To use Integration Services for ETL operations, you 
create a package and add control flow components: 
tasks, containers, and event handlers. Tasks are the 
most basic unit of work, and include such operations 
as executing SQL statements, downloading or 
uploading files on an FTP server, or running custom 
code. Containers organize tasks to be executed 
together and, in the case of the For Loop or ForEach 
Loop containers, define the conditions under which 
the tasks will execute. Precedence constraints connect 
tasks and containers in a workflow that determines 
the sequence and conditions for execution. For error 
handling, you can use a precedence constraint to 
start a workflow following a failed task, or you can 
use an event handler to define a sequence of tasks to 
execute for a specific task event, such as OnWarning 
or OnError. 


A typical ETL system in Integration Services includes 
one package per dimension or fact table in the 

data warehouse. The control flow of each package 
has at minimum a Data Flow Task for the core ETL 
processing, but can also include other tasks, such as 
an Execute SQL Task to update auditing tables with 
row counts and other batch-related information. You 
can use a series of Execute Package tasks in a parent 
package to sequence all dimension processing before 
fact processing. 


The Data Flow Task is the key component for ETL 

in Integration Services. It has data source and data 
destination adapters that support extract and load 
operations using the following data connection 
types: OLE DB, ADO.NET, ODBC, TXT, CSV, and Excel. 
In addition, adapters can read from XML files and 
write to in-memory objects for use by Reporting 
Services or ADODB. This task also includes a variety 
of transformation components to support common 


operations for dimension and fact processing, such as 
conversions of null values to a default value, trimming of 
string data, data type conversions, calculations, and de- 
duplication. 


For dimension processing, you can use the Slowly 
Changing Dimension transformation to configure 
operations to support both Type 1 and Type 2 changes 
within the same dimension. With a Type 1 change, 

a changed column is updated in place and historical 
information is lost. A Type 2 change preserves historical 
information by flagging the former dimension record as 
expired, and inserting a new dimension record flagged as 
current. 


Fact processing involves rolling up the transactional data 
to a common grain, such as item detail by day or month, 
and finding surrogate keys for dimensions. To perform 
transactional roll-up, the Aggregate transform is used to 
define the groupings and calculate the sum, minimum, 
maximum, average, or count for the groups. The Lookup 
transform component is used to match each dimension's 
natural key in the fact source to the surrogate key in the 
dimension table. 


Reusability and flexibility in package design is possible by 
using package configurations. This capability allows you to 
override properties of executables with replacement values 
stored in an XML file, a database table, an environment 
variable, a registry setting, or a variable in a parent 
package. For example, when you use the same package 
on development, test, and production servers, you can use 
a package configuration containing a connection string to 
use the applicable data source at runtime. 


Analysis Services 

Analysis Services provides two key analytical capabilities for 
your Bl applications: online analytical processing (OLAP) 
and data mining. First, the OLAP engine allows users to 
perform ad hoc queries and get answers to questions 
more easily and much faster than possible with relational 
data sources. Second, the data mining engine provides 
another way to explore data to find meaningful patterns 
and relationships in data, and it can also be used to predict 
future values based on historical patterns. 


Analysis Services runs as a service independently of SQL 
Server. Data extracted from a data warehouse is stored 
in an OLAP database as a multidimensional object called 
a cube. High-performance queries are made possible by 
the highly indexed physical structure of the cube and by 
cube aggregations, which are pre-calculated summaries 
of various combinations of data. The storage engine 
decides at query time whether a query can be answered 


by aggregations, which is the fastest method, or by the 
lowest level of detail stored in the cube, which is the 
slowest method, or by some combination of these two 
methods. Additionally, with Enterprise Edition, you can 
partition data in the cube into subsets that can be queried 
independently and in parallel, which further optimizes the 
query processing. 


Having a data warehouse in place before you build an 
Analysis Services database is not required, but it simplifies 
the development process. Your Analysis Services database 
includes a data source to define the connection to the 
data warehouse, and a data source view to provide an 
abstraction layer to the source data. You can make logical 
changes to the table structures in the data source view 
without affecting the underlying data source, and thereby 
control how data is loaded into the Analysis Services 
database. 


Dimensions are added to the database to define the 
context of analysis, such as time or location. You 

can optionally add dimension hierarchies to support 
queries that start at a summarized level and progress to 
successively more granular levels of detail, such as year, 
quarter, and month. A cube combines dimensions with 
the measures to be analyzed and defines the relationships 
between measures and dimensions. For example, you can 
link sales measures to days in a time dimension, but link 
budget measures to months in the same dimension. You 
can then compare sales and budget in the same cube at 
the month level or higher. 


The cube data can be enhanced by the addition of 
formulas for calculations and key performance indicators. 
Calculations are values that cannot be stored in the 

fact table, such as ratios like gross margin percent or 
cumulative values like year-to-date sales. Key performance 
indicators are used to compare a current value to a target 
value and to assign a score and a trend direction for 
monitoring business performance over time. 


An Analysis Services database is also the container for data 
mining models, which can be based either on data from 

a relational table or from a cube. You use data mining 

to explore data to better understand your business, such 
as defining customer demographics for more effective 
target marketing. You can also use it to predict values, 
such as forecasted sales or to predict the next step in a 
sequence. Several data mining algorithms are available 

in Analysis Services: Association, Clustering, Sequence 
Clustering, Decision Trees, Naive Bayes, Linear Regression, 
Logistic Regression, Neural Network, and Time Series. 
With Enterprise Edition, you can incorporate a data mining 
model into your ETL processes. 


Reporting Services 

Reporting Services is a reporting platform that supports the 
full life-cycle of reports—from development to centralized 
management to delivery—in a variety of formats. You 

can develop enterprise reports to combine data from one 
or more data sources and use precise layouts for online 
access or for distribution in print. You can also allow users 
to construct simpler ad hoc reports with controlled access 
to a single data source and with more limited layout and 
formatting capabilities. In either scenario, you can provide 
access to data in your online transactional processing 
systems, your data warehouse, or your OLAP systems. 


The report server runs as a Web service on an Internet 
Information Services (IIS) server and responds to client 
requests by activating components required to satisfy 

the request. These components handle data processing, 
security management, report rending, scheduling, and 
subscription delivery. A Web application is used by 
administrators to manage the server and by users to access 
reports online. All data required by the Reporting Services 
application, including reports, authorized users, data 
sources, subscription definitions, and more, are stored in a 
SQL Server 2005 database either on the same server as the 
report server or on a remote database server. A dedicated 
temporary database is also used by Reporting Services to 
manage report caching for improved performance. 


To develop enterprise reports, you define data sources and 
queries for the report, arrange and group the data, and 

set formatting properties. Supported data sources include 
SQL Server 7.0 or later, SQL Server 2005 Analysis Services, 
OLE DB, Oracle, ODBC, DB2, Hyperion Essbase, SAP, SAP BI 
Netweaver, and XML. You also configure properties of report 
items to control behavior, such as conditional formatting, 
conditional visibility, interactive sorting, and navigation to 
another location. As you progress through the development 
of the report, you can preview the report to test its 
appearance and behavior before putting it into production. 


When you are satisfied with the results, you add the 

report definition to the report server database. This report 
definition is an XML file that describes the queries, the 
layout, and report item properties. When a user executes a 
report on demand, the report’s queries are executed and the 
report is rendered with the data according to this definition. 


Execution on demand is the default setting for reports, but 
administrators can change execution properties to cache 
reports for quicker access to frequently viewed reports, 

or to generate snapshots for fastest performance and to 
maintain a history of report contents. Administrators can 
also apply security to individual reports or to a group of 
reports contained within a folder. Security configuration is 


role-based and determines which tasks a user or group of 
users can perform, from view only to full control. 


Regardless of the execution method, as long as a user has 
access to a report online, the user can save a local copy 
of the report by exporting it as a CSV, Excel, MHTML, PDF, 
TIFF, or XML file. Alternatively, a user can subscribe to a 
report for delivery on a scheduled basis to an email inbox 
or to a network file share in any of the available formats. 
In Enterprise Edition, administrators can configure data- 
driven subscriptions to deliver reports to a dynamic list of 
recipients. 


Before users can build ad hoc reports, you must create a 
report model, which is a logical representation of tables or 
views in a single data source that insulates the users from 
the complexities of a database query. Valid data sources for 
a report model include SQL Server, Analysis Services, and 
Oracle. You can enhance the model by defining formatting 
properties, adding calculations and pre-defined filters, and 
configuring properties that determine how report model 
objects display in a report.as individual rows or in aggregate. 


After you deploy a report model to the report server, a 
user launches Report Builder, a ClickOnce application, to 
create a report by selecting items from the report model 
and arranging these items as a table, matrix, or chart. The 
user can add report parameters and apply filters, sorts, 
and grouping. When the user executes the report, a query 
is generated based on the arrangement of items in the 
report. The report can optionally be saved to the report 
server if the user has permissions to publish content. 

In Enterprise Edition, the user can use the clickthrough 
feature to navigate from a summary report to the details 
for a selected value as defined by the report model. 


SQL Server 2005 

Business Intelligence 

The BI tools in SQL Server 2005 provide a wide range 

of functionality to support better access to information 

in your organization. You can use these technologies 
independently or in combination to deliver the right 
information in the right format to your users. Furthermore, 
you can build custom applications or integrate third-party 
tools to extend the built-in capabilities of these tools to 
meet specific information requirements. 


Stacia Misner is a Bl consultant, educator, author, and principal 
of Data Inspirations, a consulting firm based in Las Vegas, NV. She 
is author of Microsoft SQL Server 2005 Reporting Services Step by 
Step and co-author of Microsoft SQL Server 2005 Analysis Services 
Step by Step and four other titles published by Microsoft Press. 
Stacia specializes in providing consulting services and classroom 
instruction for Microsoft's full stack of Bl products. 
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SQL SERVER 2008 DATA PROFILING 


local XML file called AW_ProductTableAnalysis.xml 
on the desktop. 

Now you need to select individual profiles to be run 
against the Production.Product table. Use the quick 
profile option (you can assume the default settings for 
these samples). I don't explicitly walk you though the 
Quick Profile dialog box, which Figure 4 shows, but it's 
quite simple. 

The Quick Profile dialog box configures the profiles 
to use all columns for the selected table or view. To 
review, click OK in the Quick Profile dialog box, then 
select Profile Requests in the left pane of the tasks 
editor. You can browse through the various profiles 
and review the default settings the Quick Profile pro- 
cess assigns. Click OK to complete the Data Profiling 
task configuration. You can't select a Value Inclusion 
profile with the Quick Profile option; you need to use 
the Profile Requests section of the editor. 

Now execute the package by pressing F5. The 
Data Profiling task turns green and a file containing 
the output from the profile operation is created on the 
desktop. Finally, use DPV to review the output of the 
Data Profiling task. First, launch the DPV, then click 
Open and select the new AW_ProductTableAnalysis 
xml file. Figure 5 shows the task’s output in the 
viewer without any individual profile selected. Notice 
that because you only ran column-level profiles, you 
must navigate to an individual column before you 
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or not; however, the SQL Server BI stack continues to 
incorporate more and more concepts from Kimball's 
formalized data warehouse lifecycle, and data profiling 
is a good example. Data profiling is a formal part of 
Kimball's data warehousing methodology, and beyond 
Kimball, it's a fairly common task for a data warehouse 
developer. 

Data profiling was a much overdue feature in the 
SQL Server BI stack. For a first build, the feature is 
fairly impressive. Probably the biggest complaint will 
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see any of the profiles you ran in the results pane. 
You can now see the length, values, NULL ratio, and 
statistics for the various columns in the Production. 
Product table! 


A Long-Overdue Feature 


I don't know if Microsoft and the Kimball Group (www 
„kimballgroup.com) have some form of partnership 
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be that it can currently run only on SQL Server 2000 
or later, but I expect this situation to change over time. 
In addition, I expect to see the ability to build your 
own custom profiles and plug them into the larger 
data-profiling framework. SQL Server 2008 has sev- 
eral key BI enhancements, and the SSIS data-profiling 
capability is definitely one such feature. SOL) 
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Figure 4 


The Ouick Profile 
dialog box 


Figure 5 
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Generate a 


Clickthrough Report 
Using 
Visual Studio ReportViewer 


local reports without SSRS 


ased on the number of questions Pve seen 
B about the Visual Studio ReportViewer con- 

trol, it appears that many administrators 
don’t know that this control can perform sophisticated 
operations on its own. You might have seen examples 
of SQL Server Reporting Services (SSRS) clickthrough 
reports, but did you know that this same functionality 
is available when generating a local report with the 
ReportViewer? 

A local report doesn't require an SSRS server 
because the application provides the tools to distribute 
the report. So, even without benefit of SSRS, you can 
give the user the ability to execute parameter-driven 
reports of all kinds—even drillthrough reports. Report- 
Viewer also lets developers manage input parameters 
and integrate reports with other UI elements on the 
form or ASP page. While it might seem challenging 
to implement a report with the ReportViewer control 
because you're responsible for much of the leg work, in 
the end you'll find this an essential tool when creating 
forms over data applications. This control gives devel- 
opers virtually infinite flexibility in how the Report 
Definition Language (RDL) is presented. For a com- 
prehensive explanation of how to create and manage 
reports with ReportViewer, see the Learning Path at 
www.sqlmag.com, InstantDoc ID 99192, 

Although you can use either Visual Studio (VS) 
2008 or VS 2005 to set up a clickthrough report without 
using SSRS, after SQL Server 2008 ships, VS 2008 will 
include SSRS hooks, which will make VS 2008 more 
useful for this project. (Note that in VS 2008 Report- 
Viewer has been renamed Microsoft ReportViewer.) 


Creating the 

Project and Datasource 

To create a project and datasource, you first need to 
build an example application. Start by creating a new 
Windows Forms project in VS. I created a strongly typed 
datasource that points to the SQL Server Adventure- 
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Learn how to create 


Works example database. I chose not to select any 
table, view, or stored procedure sources for the data 
set because I want to illustrate how to use the Table- 

Adapter Designer to create a new TableAdapter from 

scratch. The following steps take you through the code 

generation process for the first TableAdapter. 

1. In the new Windows Forms project, click the Data 
menu and select Add New Data Source. 

2. For the Data Source Type, select Database. 

3. Confirm the Data Connection or point to the 
AdventureWorks sample database. (Adventure- 
Works is available as either an optional installa- 
tion choice or a download at www.codeplex.com/ 
MSFTDBProdSamples/Release/ProjectReleases 
„aspx?Releaseld=4004.) 

4. When prompted for selected Database Objects, 
click Finish without selecting anything —no Tables, 
Views, Stored Procedures, or Functions. 

5. Confirm this (lack of) selection. Doing so creates 
an empty TableAdapter .XSD DataSet. 

6. Right-click the AdventureWorksDataSet.xsd file in 
the Solution Explorer, and select Open. 

7. Click Toolbox to open the DataSet toolbox 
window in the VS UI. Now you're ready to auto- 
generate the first of two TableAdapter classes to be 
used as report row sources. 

8. Click and drag a TableAdapter to the XSD 
designer pane. Doing so relaunches the Table- 
Adapter Configuration Wizard but exposes a dif- 
ferent set of options. Then click Next. 

9. Select Use SOL statements, and click Next. You'll 
see a dialog box in which you can enter a SQL 
query to return a rowset to populate the 
base report (in this case). 

10. Type a SQL statement that returns col- 

umns required by the report and calls for 
a single input parameter, as shown in Web Listing 
1 (www.sąlmag.com, InstantDoc ID 99192). 
Note that the SOL statement could be a stored 
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| =string.format("Stock on Hand for. (0}", First(Fields!Name Value) 


Product 


Location Shelf Bin Quantity 


Name 


*Fields!ProductiD Value »Fields!Locatioi «Fields «Fields *FieldsiC »Fielda!Name Value 


Total =SumtFx 


Figure | 
The Stock on Hand 


report layout 


procedure, but entering it yourself makes it easier 
to understand the process. 

11. Click Advanced Options and clear Generate 
Insert, Update, and Delete statements to disable 
the generation of the code for these statements. 
You're creating a report, not a data management 
front end. 


Creating the 

RDLC Reports 

This example uses Report Definition Language Client- 
Side (RDLC) reports; one to show products by style 
and another to display details on a selected product. 
The first report displays pertinent information from the 
Products table based on the Product TableAdapter. The 
second report displays information from selected rows 
based on the ProductInventory TableAdapter. Although 
I won't go into great detail about how to build a report, 
here is a brief summary of how to do so: Add a new 
Report item to the Project, and then add a Table Report 
item. Drag appropriate fields from the Data Source’s 
DataSet to the Table item. Note that these 
operations are recorded in the RDLC and 
that the DataSet name is an integral part 


=Fields!Name Value 


Figure 2 


Stock report layout 


Coat 


=FieldsProductHumb 


Right-click here to set click- 
through link 


12. Click OK and then Next. 

13. In the Choose Methods to Generate dialog box, 
clear the check box to disable code generation for 
the Return a DataTable and GenerateDBDirect- 
Methods. (You need only a Fill at this point.) 

14. Click Next, and Finish. 


Now that VS has completed the code genera- 
tion for the first TableAdapter, repeat the process to 
create a second TableAdapter to return rows for the 
drillthrough query. The code generation process for the 
second TableAdapter 
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is similar to the pro- 
cess for the first Table- 
Adapter, with a few 
changes. Repeat steps 
8-14, but this time, 
use the Web Listing 
2 SQL statement in 
step 11 to return the 
rowset. This param- 
eter query uses the 
ProductID passed 
from the report pro- 
cessor as the input 


Figure 3 


Setting the click- 
through report name 
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parameter. 

After you've cre- 
ated the new TableAdapter, right-click the diagram 
and rename it StockByProduct. Now you have the 
two related row sources that you can use to generate a 
clickthrough report. 


Standard List Price 


elds =Fields!ListP) =Fields!Stanc =Fields!ListPy 


of this process. (Note that when you're 
working with report DataSets, you must use 
the DataSet name to help the report pro- 
cessor know where to apply the data you're 
returning.) To complete the report, set some 
colors and a few formatting properties. For 
more information about building reports, 
see the Learning Path at www.sqlmag 
.com, InstantDoc ID 99192. Figure 1 shows 
how the report appears in the designer when com- 
pleted. Note that if you were to use existing report 
definition files, you'd need to reconnect them to the 
correct Data Source elements. The second report (as 
shown in Figure 1) includes data from columns from 
the StockByProduct TableAdapter rowset. 


Linking to the 

Clickthrough Report 

Now you're ready to link the two reports together. The 

idea is to display the Stock on Hand report (shown in 

Figure 1) when the user clicks a Product Number item 

in the Stock report (shown in Figure 2) by passing the 

currently selected Product number to the Product- 

Inventory FillByProductNumber query as a param- 

eter. Linking the two reports is accomplished by editing 

the RDLC report definition files using the Report 

Designer built into VS. The following steps take you 

through the process of linking two reports together: 

1. In the Stock report, right-click the Product Number 
cell to open the Textbox Properties pane. (Note 
that because the Textbox report cell is a property, it 
can be set with an expression. This means you can 
trigger any RDLC report you want to—as long as 
the report processor can find it at runtime.) 

2. Click the Navigation tab, and set Jump to report to 
the name of the drillthrough RDLC file, as shown 
in Figure 3. 

3. Now, set up the parameter to be passed to the 
linked report. Click the Parameters button, which 
is shown in Figure 3. 
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4. In the Parameters dialog box, shown in Figure 4, 
enter ProductIDWanted as the Parameter Name 
and choose the expression for ProductID from the 
drop-down list. Click OK twice to return to the 
RDLC designer. 


Setting Up the 

ReportViewer Control 

To have VS generate most of the UI code, you can 

leverage the TableAdapter and VS's ability to set up 

the UI elements to capture the input parameter, Style- 

Wanted, for the initial guery. StyleWanted should be 

sufficient because you won't need any UI elements 

to browse through the rowset or make changes. The 
following steps take you through setting up the Report- 

Viewer Control: 

1. Open the Form! design-mode window. 

2. Drag the Product data set from the Data Sources 
window to the form design surface. Doing so adds 
the ProductTableAdapter to the form's generated 
code. It also adds a label and textbox to capture 
the parameter, and a Fill button to launch the Fill 
method defined for the TableAdapter. In addition, 
it supplies a DataGridView control, and Table- 
AdapterManager, BindingSource, and Binding- 
Navigator classes that you don't need. 

3. Delete these unneeded classes; they're used to 
provide update and scrolling mechanisms that you 
don't need. All you want is the tool strip to manage 
the parameter and expose the Fill button. The 
result should look like Figure 5. 

4. Now drag the ReportViewer control to the form. 
Immediately, the control prompts for the name 
of the RDLC report file or the path to an SSRS 
server-hosted report. It will look like Figure 6, 
page 30, Note that the control has been renamed 
Microsoft ReportViewer in VS 2008, and the con- 
trol has been moved to the Reporting section of 
the toolbox. 


When you add the ReportViewer control to the 
form or rebind the Data Source, VS drops a line or two 
of code into the Form_Load event handler because 
it doesn't know where else to put it. However, you 
shouldn't leave the code there. Instead, move the Fill 
method and RefreshReport method calls to the Tool- 
Strip Button_Click event handler so they're executed 
after you've captured the desired Style parameter from 
the user. Incidentally, the generated code won't work 
because it ignores the fact that the Fill method requires 
an input parameter. See Web Listing 3 for the code. 

When you enter a style value (such as “M”) in the 
ToolStrip text box and click the ToolStrip Fill button, 
this code executes the parameterized Fill method, binds 
the rowset to the Report DataSet, and shows the report 
hosted in the ReportViewer control. 
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Configuring the 

Clickthrough Report 

At this point, the initial Products By Profit report 

should work. Now you need to have ReportViewer 

call the routine that invokes the StockReport FillBy- 

ProductID method and bind to the generated Data- 

Table. The following steps show you how to configure 

the clickthrough report: 

1. First, make sure your form's Class includes an 
Imports statement to help the compiler resolve the 


Figure 4 


Designating the 
parameter to pass to 


the clickthrough report 


StyleWanted: FillByStyle 


m 


a] Adventure WorksDataSet {1 ProductTableAdapter 


15% FillByStyleToo! 


Microsoft.Reporting. WinForms namespace. The 
Imports statement permits VS to see the Report- 
Viewer DrillthroughEventArgs: Imports Microsoft 
-Reporting. WinForms 

2. Next, set up an input parameter for the Stock- 
Report using the VS Report Designer. As a 
reminder, click the StockReport in the designer 
and select Report, Report Parameters. 

3. Add a new parameter. Figure 7, page 30, shows 
that the ProductIDWanted parameter is defined. 

4. Mark the parameter as hidden because it’s pro- 
vided only in code. 


Coding the Drillthrough 
Event Handler 
The next step is to respond to the Drillthrough event 
that’s fired when a user clicks the report cell (text box) 
that’s programmed to navigate to the Stock Report. 
This routine (shown in Web Listing 4) ensures that it’s 
the Stock Report that’s been requested. 

Now you need to execute the parameter-driven Fill 
method and point the StockReport Data Source to the 
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Figure 6 
DataTable that’s returned. The code in Web Listing 5 
handles this task. 

The comments in this routine make it fairly self- 
explanatory. The base ProductProfitByStyle report 


Pointing to the 
ProductProfitByStyle 
report definition file 
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Figure 7 


Defining the report parameters for the Stock Report 
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software 


should now run (after a style parameter is provided). 
When the user clicks the Product Number text box cell, 
the Click Through expression is programmed to run the 
stock guery and display a report with the new data. To 
get back to the first report, the user needs to click the 
Back button (green arrow) on the ReportViewer. 


Infinite Flexibility with 
ReportViewer 
Congratulations. You now know how to set up a 
clickthrough report without using SSRS. In SSRS, 
there's no need for extra code to generate the data 
sources because the code is written at design time 
and re-executed at runtime. However, using the VS 
ReportViewer control without SSRS gives you infinite 
flexibility because you can call whatever methods 
you desire to return the rowset consumed by the 
clickthrough report. Just remember that the Report- 
DataSource must match the named RDLC DataSet, 
and the named columns defined in the RDLC must 
be exposed in the DataTable passed to the report 
SQL] 
InstantDoc ID 99192_ 


processor. 


ApexSQL  www.apexsql.com 


or phone 866-665- 


SQL Server Magazine * www.sqlmag.com 


Rebuild Only the Indexes 
that Need Help 


Avoid the performance impact of 


rebuilding willy-nilly 


common misconception among SOL Server 

DBAs is that they need to rebuild their indexes 
often—even as often as every night. The theory is that, 
to be effective, indexes can't have any fragmentation. 
Another common misconception about indexes is that 
if you don't rebuild them frequently, they'll become 
corrupted, and then you're in a HEAP of trouble. 
(Apologies for the pun.) 

For several reasons, 1 don't want to get into great 
detail about what fragmentation is or how it happens. 
First, I just don't have enough space to do the topic 
justice, and second, a lot of information about this sub- 
ject is readily available. (For several excellent articles by 
Kalen Delaney, check out the Learning Path.) Instead, 
I want to focus on a quick solution by highlighting 
a stored procedure—rebuild_indexes_by_db—that 
tackles only the indexes that actually need help. 


Who’s Corrupting You? 

Corruption in databases, and especially in indexes, 
was fairly common back before SQL Server 7.0, often 
necessitating index rebuilds. Changes in the internal 
structures of indexes and how the system handles 
them within the storage engine have essentially elimi- 
nated index corruption. Today, most index corruption 
results from faulty hardware or drivers—not from 
the database engine. Regularly running the DBCC 
CHECKDB database-integrity task is the best way 
to catch index corruption before it goes too far. It’s 
true that rebuilding a non-clustered index can fix the 
corruption, but it’s important to address the source. 
Simply rebuilding each night is probably masking the 
true problem. 


Fragmentation Fallacies 

Keep in mind that the primary theme of these articles is 
performance. Can fragmentation affect the performance 
of your queries? Under certain circumstances, of course 
it can! Although the white paper “Microsoft SQL Server 
2000 Index Defragmentation Best Practices” (www 
.microsoft.com/technet/prodtechnol/sq1/2000/ 


maintain/ss2kidbp.mspx) was written for SQL Server 


2000, the same principles apply to SQL Server 2005. 
If you're performing typical online transaction pro- 
cessing (OLTP) operations—Inserts, Updates, Deletes, 
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Selects—on small numbers of rows, fragmentation 
doesn't harm you nearly as much as you might think. 

Essentially, issuing an Index Seek on a single row 
takes the same amount of time regardless of where 
the row physically resides in the database file on disk. 
Therefore, you'll see little or no difference in normal 
OLTP operations if the rows are next to each other on 
disk or scattered in various physical locations. However, 
if you're performing lots of full or partial index scans, 
the amount of fragmentation can play a much larger 
role in how efficiently the storage engine can read that 
data from disk. 


Re-Indexing Redux 

Sure, you can attempt to re-index everything each 
night. But re-indexing can also have a dramatic impact 
on performance. In fact, re-indexing can be one of the 
most resource-intensive and potentially intrusive opera- 
tions that you can perform. These operations 
can monopolize your processors, max out 
your disk I/O subsystem, and block users 
from accessing tables. They can increase the 
time and effort necessary to back up your logs, and they 
can really put a damper on log-shipped systems—even 
when the re-indexing occurs during off hours. 

So, let's be smart about this. Let's tackle only 
the indexes that need attention. Rebuilding a non- 
fragmented index only wastes resources and potentially 
hampers other users or activities. A quick Internet 
search will reveal many examples of how to check for 
fragmentation levels, then determine which indexes to 
address and how. The stored procedure that Listing 1, 
page 32, shows—rebuild_indexes_by_db—goes beyond 
most of those examples to account for other concerns 
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Pil talk more about this 
article's stored procedure 
in our Performance Tuning 
and Optimization forum. 
Check it out! 


REBUILDING INDEXES 


LISTING I: Ouery to Defragment Only the Indexes That Need It 


CREATE PROCEDURE [dbo].[rebuild indexes by db] 


= 


t 
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ApexSQL Enforce 


Rapid, repeatable and impartial 
Database reviews 


vw“ Enforce company standards and conventions 

V Ensure databases meet industry “Best Practices” 

vw“ Produce detailed reports of violations and results in HTML 
y“ Schedule nightly, unattended reviews via CLI 

Y Use our large database of rules or write your own 

VW“ Detect and fix database problems automatically 

y“ Set customizable thresholds and tolerances 


V Edit rules in a powerful IDE with Intelliprompt 


@DBName NVARCHAR (128) 
@ReorgLimit TINYINT = 15 


-- Name of the db 


Sort in tempdb option 


Always do LOB compaction 
Only do Clustered indexes 


I SeRPRPRRBR 


such as rebuilding indexes online, rebuilding by parti- 
tion, utilizing a MAXDOP setting, excluding specific 
tables, and more. SQL Server 2005 has made it very 
complicated to perform online index rebuilds for all but 
the simplest of tables, but this stored procedure makes 
it a little easier. 

The primary goal of this stored 
procedure is to loop through all 
of a given database’s indexes and 
decide which indexes to either 
rebuild or reorganize, given the 
thresholds you specify. By default, 
it ignores any indexes with less 
than 15 percent fragmentation, 


Fully customizable, 
rule-based, Database 
Standards enforcement 


Žž 


-- Minimum fragmentation % to use Reorg method 

@RebuildLimit TINYINT = 39 -- Minimum fragmentation % to use Rebuild method 
@PageLimit SMALLINT = 18 -- Minimum + of Pages before you worry about it 
@SortInTempdb TINYINT = ð -- 
@OnLine TINYINT = Y -- 
@ByPartition TINYINT =1 -- 
@LOBCompaction TINYINT = 1 -- 
@DoCIOnly TINYINT = Y -- 
@UpdateStats TINYINT =1 -- 
@MaxDOP TINYINT = 0 -- 
@ExcludedTables NVARCHAR (MAX 


Online Rebuild, Reorg is ignored 
Treat each partition separately 


Update the statistics on Reorgs 
Default so omit this from the statement 
-- Comma delimited Tist of tables (DB.Schema.Table) to exclude from processing 


reorganizes indexes that have 15 to 30 percent frag- 
mentation, and rebuilds indexes that are more than 30 
percent fragmented. Although Pve accounted for most 
common problems and options in Listing 1's param- 
eters, you might have to customize the stored procedure 
to suit the needs of more advanced implementations. 


Forward to the Forum 

Pll talk more about the rebuild_indexes_by_db 

stored procedure in SOL Server Magazines Perfor- 

mance Tuning and Optimization forum (sqlforums 

„windowsitpro.com/web/forum/). In the meantime, 

happy re-indexing! SQL] 
InstantDoc ID 99019 
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ne reason why DBAs often use relational 


(i.e., related) databases is that they avoid 


O data duplication and hence save disk space. 


In relational databases, you use identifiers for items 
when possible. That way, you need to record the items’ 
textual descriptions in only one table. When you need 
to refer to an item in another table, you simply specify 
that item’s identifier. For example, suppose you have 
a database of movie data. In the database, you have a 
table named Genre that lists genre textual descriptions 
(e.g., Comedy, Documentary) and their identifiers 
(e.g., 1, 2). You now want to include a table similar to 
the MovieReview table in Lesson 4. The MovieReview 
table includes a column named Genre, which contains 
genre textual descriptions. Rather than list the genre 
textual descriptions, you can use the genre identifiers 
provided in the Genres table. When you need to display 
data from the MovieReview table with the genre textual 
descriptions in the Genre table, you can use a JOIN 
clause to join the tables. 

JOIN clauses, which were first incorporated into 
SQL Server 7.0, let you build queries that retrieve data 
from two related tables. Prior to SQL Server 7.0, you 
had to join tables using a technique that consisted of 
listing the tables in a FROM clause and joining them 
in a WHERE clause that used special notations to rep- 
resent different types of joins. SQL Server 2005 still lets 
you use the old join technique. However, future versions 
of SQL Server might not support it. Thus, you should 
learn how to use the JOIN clauses, no matter whether 
you're first learning how to join tables or have joined 
tables for years using the old technique. 

There are five types of JOIN clauses available in 
T-SQL: 

+ INNER JOIN 

+ LEFT OUTER JOIN 

+ RIGHT OUTER JOIN 
+ FULL OUTER JOIN 

+ CROSS JOIN 
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101, 
RO 


How to join tables 


At first glance, these names might seem confusing. To 
help you understand the differences and which type 
of join is suitable for which type of queries, Pll walk 
you each type of join. But first, you need to create and 
populate two tables so that you can follow along with 
the examples. 


The Prerequisites 

The examples in this lesson rely on objects in the MyDB 

database, which was created in Lesson 2. If you didn’t 

previously create the MyDB database, you'll need to do 
so. Go to “T-SQL 101, Lesson 2” (April 2008, InstantDoc 

ID 98105) to get the code and instructions. 

You also need to create two new tables: 

Movie and Genre. You can create and popu- 

late these tables by following these steps: 

1. Download the CodeToCreateMovie&- 
GenreTables.sql and CodeToPopulate- 
Movie&GenreTables.sql files. Go to www.sqlmag 
.com, enter 99137 in the InstantDoc ID text box, 
click Go, then click the 99137.zip hotlink. 

. Create the Movie and Genre tables. Open 
SQL Server 2005’s SQL Server Manage- 
ment Studio (SSMS) or SQL Server 2000's 
Query Analyzer and copy the code in 
CodeToCreateMovie&GenreTables.sq] into the 
query window. Execute the code. 

. Populate the Movie and Genre tables by executing 
the code in CodeToPopulateMovie&GenreTables 
.sql in SSMS or Query Analyzer. 
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Figure I 


Comparing the tables” 
contents reveals that 
each table has a 
GenrelD column 
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Documentary 
Drama 
Fiction 
Horror 
Sci-Fi 
Thriller 


GenreID Movie 


Meet the Clusters 
Defragger Hill 


Bits, Bytes, Videotape 


Chariots of Firewire 
V for Vendor 
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To Kill a Mocking Nerd 


The INNER JOIN Clause 

An inner join is the most common type of join between 
two tables. A column or columns in the first table are 
compared to a column or columns in the second table 
using a comparison operator such as the equals (=) 
operator. Whenever a match is found, the rows from the 
first table are connected to the rows in the second table. 
Unmatched rows are discarded. The structure you end 


Genre Movie 
Comedy Meet the Clusters 
Documentary Defragger Hill 
Documentary When Harry Re-Indexed Sally's Table 
Drama Bits, Bytes, Videotape 
Drama Bridge Over The River Motherboard 
Figure 2 Fiction The Perfect Requirements Document 
Horror Planet Of The Ape-Like DBA's 
Horror The User who Knew Too Much 
Results from Sci-Fi Chariots of Firewire 
using an inner Sci-Fi SCSI Terminator: Connection Day 

a li Il ch Sci-Fi The Day the Hard Drive Stood Still 

Join to list all the | sci-fi v for Vendor 

movies sorted by 


their genres 


up with after the join is almost a new table unto itself. It 
will contain the selected columns from both tables. For 
example, if you have Tablel, which has ColumnA and 
ColumnB, and Table2, which has ColumnC, an inner 
join will produce a structure with ColumnA, ColumnB, 
and ColumnC. Whether the individual rows contain data 
depends on the result of the join. 

Let's explore inner joins further with the Genre and 
Movie tables. To begin, execute the code 


SELECT * FROM Genre 
SELECT * FROM Movie 


Chariots of Firewire 
SCSI Terminator: Connection Day 


The Day the Hard Drive Stood Still 
v for Vendor 


when Harry Re-Indexed Sally's Table 


Bridge Over The River Motherboard 
The Perfect Requirements Document 
Planet Of The Ape-Like DBA's 

The User Who Knew Too Much 


SCSI Terminator: Connection Day 
The Day the Hard Drive Stood Still 


Figure 3 


Results from using an inner join to list all the movies in the Sci-Fi genre 
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As Figure 1 shows, both tables have 
a GenreID column, which makes 
them an excellent join candidate. 
Each GenrelD in the Movie table 
points to the appropriate genre 
record in the Genre table. (Note 
that there is one GenrelD—10—in 
the Movie table that doesn't have a 
corresponding record in the Genre 
table. This is intentional and will 
be used for demonstration pur- 
poses in “The OUTER JOIN 
Clauses” section.) 

To create a list of movies sorted 
by their genres, you can use the 
INNER JOIN clause to join the 
Movie table to the Genre table in 
code such as 


SELECT g.Genre, m.Movie 
FROM Movie m 
INNER JOIN Genre g 
ON g.GenreID = m.GenrelD 
ORDER BY g.Genre, m.Movie 


Let's walk through this code line by line. The first 
line selects the columns you want to see in the result 
set, which in this case are the Genre column from the 
Genre table and the Movie column from the Movie 
table. Notice that the column names are prefixed with 
the letters g and m. These are table aliases. The alias for 
each table is specified in the FROM clause immediately 
following the table name and usually consists of a 
shortened version of the table name. In this case, m is 
the alias for Movie and g is the alias for Genre. 

Aliases are used in this query because both tables 
have columns named GenreID. Alternatively, you 
could use the TableName.ColumnName naming con- 
vention (e.g., Movie.GenreID, Genre.GenreID) when 
specifying the columns, but I find this approach a bit 
long winded, especially when referencing tables with 
relatively long names. It really comes down to personal 
preference, so feel free to use either method. 

For inner joins (and outer joins), tables participating 
in the join are listed in the FROM clause. In this case, 
the Genre table and the Movie table are being joined. 
In simple inner-join queries, the order in which you list 
the tables doesn’t matter. For example, the code 


SELECT g.Genre, m.Movie 
FROM Genre g 
INNER JOIN Movie m 
ON g.GenreID = m.GenreID 
ORDER BY g.Genre, m.Movie 


would provide the same results. That’s not the case in 
some other types of joins, as you'll see shortly. 
The ON keyword is used to indicate which columns 
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should be matched together. In 
this case, were matching the 
GenreID columns from both 
tables. Other column compari- 
sons can be added after the 
ON keyword by using the AND 
keyword. 

Finally, the ORDER BY 
clause is used to sort the results 
alphabetically first by genre, then 
by movie. Figure 2 shows the results. 

Now suppose you want the result set to show only 
those movies in the Sci-Fi genre, as Figure 3 shows. 
This can be accomplished by using a WHERE clause 
to add a filter: 


Genre.GenreID 


a 
2 
2 
3 
3 
4 
5 
5 
6 
6 
6 
6 
7 


SELECT g.Genre, m.Movie 
FROM Movie m 
INNER JOIN Genre g 
ON g.GenreID = m.GenreID 
WHERE g.Genre = ‘Sci-Fi’ 
ORDER BY m.Movie 


As you can see, the INNER JOIN syntax clearly sepa- 
rates the joining criterion from the filtering criterion. In 
all five types of joins, the WHERE clause filters data 
from the result set after the join has already taken place. 
This is a key difference between the JOIN clauses used 
in SQL Server 7.0 and later and the old join technique 
used in SQL Server 6.5 and earlier. In the old join 
technique, both the join and filter criteria are in the 
WHERE clause. 


The OUTER JOIN Clauses 

There are three types of outer joins: left outer join, 
right outer join, and full outer join. When considering 
outer joins, it helps to think of two tables arranged 
side by side. The table on the left is the first table in 
the join. The table on the right is the second table in 
the join. Left outer joins return all rows from the first 
table and only rows from the second table that meet the 
join criteria. Columns from the second table that don’t 
match the criteria will contain NULL values. Right 
outer joins return all rows from the second table and 
only rows from the first table that meet the join criteria. 
Columns in the first table that don’t match the criteria 
contain NULL values. Full outer joins return all rows 
from both tables. Whenever a row in one table has 
no match in the other table, the columns will contain 
NULL values. 

Let’s perform a left outer join on the Genre and 
Movie tables to see whether there are any genres that 
don’t have a matching movie. In the following query, 
Genre is the left table and Movie is the right table: 


SELECT g.GenreID AS ‘Genre.GenreID’ , 
g.Genre, m.MovielD, 
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Genre MovieID Movie.GenreID Movie 


Comedy 1 Meet the Clusters 


Documentary 
Documentary 
Drama 

Drama 
Fiction 
Horror 
Horror 
Sci-Fi 
Sci-Fi 
Sci-Fi 
Sci-Fi 
Thriller 


Defragger Hill 


V for Vendor 
NULL 
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m.GenreID AS “Movie.GenrelD”, 
m.Movie 
FROM Genre g 
LEFT OUTER JOIN Movie m 
ON m.GenreID = g.GenrelD 
ORDER BY g.Genre, m.Movie 


As this code shows, the OUTER JOIN syntax is 
similar to the INNER JOIN syntax, except that it 
matters which table gets listed first in the FROM 
clause. The Genre table must be listed first because it's 
the left table. 

Figure 4 shows the results of this left outer join. 
As you can see, the columns from the Genre table for 
GenreID 7 contain actual data, whereas the columns 
from the Movie table contain NULL values. Thus, 
there is a GenreID 7 in the Genre table but not in the 
Movie table. 

Now let's perform a right outer join on the Genre 
and Movie tables to see whether there are any movies 
that don't have a matching genre. In this query, Genre 
is still the left table and Movie is still the right table: 


SELECT g.GenreID AS “Genre.GenrelD”, 
g.Genre, m.MovielD, 
m.GenreID AS “Movie.GenrelD”, 
m.Movie 
FROM Genre g 
RIGHT OUTER JOIN Movie m 
ON m.GenreID = g.GenreID 
ORDER BY g.Genre, m.Movie 


As Figure 5, page 36, shows, in the first row, the Genre 
table's columns contain NULL values, whereas the 
Movie table's columns contain data. So, there is a Gen- 
reID 10 in the Movie table but not in the Genre table. 
Finally, let's perform a full outer join on these tables 
to identify records in each table that have no corre- 
sponding record in the other table. When performing a 
full outer join, the concept of left or right table doesn't 
apply, so the table order doesn't matter. The query 


SELECT g.GenreID AS “Genre.GenrelD”, 
g.Genre, m.MovielD, 
m.GenreID AS “Movie.GenrelD”, 
m.Movie 
FROM Genre g 
FULL OUTER JOIN Movie m 
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Results from using an 
outer left join to list 
the genres with no 
matching movies 
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Figure 5 


Results from using an 
outer right join to list 
the movies with no 
matching genres 
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Figure 6 


Results from using 
a full outer join to 
find invalid data 
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ON m.GenreID = g.GenreID 
ORDER BY g.Genre, m.Movie 


produces the results in Figure 6. If you look at the 
results closely, you'll see all the movies with matching 
genres, the genres with no matching movies, and 
the movies with no matching genres. In essence, the 


MovieID Movie.GenreID Movie 


To Kill a Mocking Nerd 


Meet the Clusters 
ary Defragger Hill 


ary 


V for Vendor 
NULL 
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full outer join gives you the results of an inner join, 
left outer join, and right outer join in one handy 
package. 

Full outer joins are especially useful for tracking 
down data integrity problems. For example, from 
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the results in Figure 6, you can 
see that “To Kill a Mocking 
Nerd” in the Movie table has 
an invalid GenrelD. If you 
change this movie’s Genre- 
ID from 10 to 7 in the Movie 
table and rerun the full outer 
join query, this data anomoly 
disappears. 


SCSI Terminator: Connection Day 
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The CROSS JOIN Clause 

A cross join produces a Cartesian product of two tables, 
which means that every row in the first table is joined to 
every row in the second table. You can read more about 
Cartesian products (and the mathematics behind them) 
at en.wikipedia.org/wiki/Cartesian_product. 

Because of the exponential nature of the results, the 
cross join is commonly used to generate large volumes of 
test data from relatively small 
tables. For example, the Genre 
table has only 7 rows and the 
Movie table has only 13 rows, 
but an unfiltered cross join 
would produce a result set 
containing 91 rows. 

The cross join is also 
commonly used because it’s 
easy to implement. You don’t 
need to include the ON key- 
word in cross joins, and the 
order in which the tables are listed doesn’t matter. 
For example, let’s perform a cross join of the Movie 
and Genre tables. To limit the result set, let’s include a 
WHERE clause so that only comedies and thrillers are 
included. The query would look like 


SELECT g.Genre, m.Movie 
FROM Movie m 
CROSS JOIN Genre g 
WHERE Genre = ‘Comedy’ 
OR Genre = ‘Thriller’ 
ORDER BY Genre, Movie 


As Figure 7 shows, every Genre record is connected to 
every Movie record. 


JOIN In on the Fun 
As you can see, it’s easy to use the INNER JOIN, 
LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL 
OUTER JOIN, and CROSS JOIN clauses. Now that 
you know how to perform inner joins, outer joins, and 
cross joins, I hope that you'll use your newfound join 
skills to create some sample reports. Just try to have 
fun and keep the cross joins to a minimum, especially 
when dealing with tables with thousands of rows or 
more. (Your DBA will appreciate it.) [SQL 
InstantDoc ID 99137 
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IMANY-to-MANM 
Relationships 
in the Data Warehouse 


Designing your data warehouse 
to support market-basket analysis 


ust like any well-designed database, a data ware- 

house contains relationships. From everything 

that you've read in this series over the past year, 
you might be tempted to think that all relationships 
in a data warehouse are one-to-many (1:M), from the 
dimension to the fact table. However, that’s not true. 
Occasionally, a data warehouse requires a many-to- 
many (M:N) relationship. 

You can successfully argue that the star schema is 
inherently a M:N relationship. So why are we having 
this discussion? Competition is forcing businesses to 
become more innovative, and this innovation can bring 
with it more products and services in combinations that 
might not have been considered before. This innovation 
also maps to increasingly complicated business models, 
and subsequently, to more complex data warehouse 
designs. A business intelligence (BI) analyst who knows 
how to evaluate customer buying behavior can extract 
information from the data warehouse to determine 
which combinations of products and services will 
entice a customer to make a purchase. 

One technique the BI analysts use is called market- 
basket analysis, which is the process of evaluating col- 
lections of items in the data warehouse to determine 
whether they’re somehow related. Market-basket 
analysis is frequently (but not exclusively) used in the 
retail sector. Information gleaned from market-basket 
analysis helps retailers understand buyers’ needs. 
Based on this analysis, retailers can better appeal to 
buyers by doing things such as reorganizing a store 
layout. For example, the idea of co-locating beer and 
baby diapers—so that when Dad runs to the market 
to pick up baby diapers, he also buys beer—was a 
finding that came out of early market-basket analysis 
efforts. Understanding the buyers’ mind-set can enable 
retailers to develop cross-promotional marketing pro- 
grams, capture new buyers, trim inventories down to 
the products that sell best, and design intelligent sales 
events. As a data warehouse designer and DBA, you 
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need to know how to model your database to support 
market-basket analysis so that your BI analysts can 
extract the information your business needs to beat 
the competition. 

A good example for this discussion is a build-to- 
order checking account—a bank account in which 
customers select the features they want from a menu, 
essentially customizing the account and deciding how 
much they will pay per month in bank fees. Some fea- 
tures are free and some are not. The free features might 
include a check card with reward points (choose your 
favorite rewards program), online banking, bill-pay 
service, email alerts, no minimum balance, no direct 
deposit required, and identity theft insurance. Premium 
services might include no fees for using another bank's 
ATM, rebates on ATM fees if you do use another 
bank's services, interest-earning accounts, double 
rewards points, money back or extra rewards points on 
check-card purchases, anniversary cash bonuses, and 
overdraft fee forgiveness. Customers choose some or 
all of the free features and one or two of the premium 
features; these features are bundled into a customized 
checking account. Then customers can choose as many 
additional premium features to add to the custom 
account as they want for a fee of say, $2 per feature 
per month. 


The OLTP Database Schema 
If you were designing the OLTP database schema for a 
build-to-order checking account, it would look similar 
to Figure 1 (page 38). This schema stores each account 
configuration, with general account information stored 
in the Account table and the features for each account 
stored in the AccountFeatures fact table. What's not 
included in this figure are tables to capture the many 
types of transactions associated with an account, such 
as check-card purchases, purchases made with paper 
checks, bill-pay records, and ATM withdrawals. 

Let’s assume that your datasource looks similar 
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to Figure 1, with Account and AccountDetail in a 
1:M relationship. Now your company's CFO and vice 
president of marketing want to know which customers” 
accounts are using which features most often and in 
which combinations, if any. They want to measure the 
efficacy of the free features versus the premium features 
and determine which features sell the best, so they can 
better direct the bank’s marketing efforts. They also 
want to better understand their customers” buying 
behavior. In other words, the CFO and vice president 
of marketing want to do market-basket analysis. Why 
does a customer pay for a purchase one time with a 
check, but pays for his or her next purchase with a 
check card? What are the circumstances surrounding 
these varying types of buying behavior? If the cost to 
process a paper check is a set fee per check, but the 
cost to process a check card transaction is a percentage 
of the total transaction amount, it might behoove the 
bank to figure out a way to encourage its customers— 
especially the big spenders—to pay with checks. 


The Star Schema 


To create the star schema that will help determine 


the answers to these ques- 
tions, you'll need a Transac- 
tion fact table, a dimension 
for the Account, and an 
AccountFeatures fact table. 
In addition, you'll need cat- 
egorical dimensions for the 
features, account category 
(e.g., free, premium), and 
the rewards program (e.g., 
airline, hotel, rental car). 
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answer the CFO’s questions, 
a transaction needs to link to 
one or more features in the 
AccountFeatures fact table, 
and each of the Account- 
Feature records needs to link 
back to its parent account. 
This complexity mandates 
a M:N relationship between 
the Account dimension and 
the Transaction fact table, as 
shown in Figure 2. In addi- 
tion, there's a conventional 
1:M relationship between 
the Account dimension and 
the Transaction fact table. 

This schema is multi- 
functional because you can 
use it for different types 
of analyses. The Account, 
Location, Customer, and Time dimensions are all 
conformed, meaning they can participate in multiple 
star schemas. Taken as a unit, the Account, Customer, 
Location, and Time dimensions and the Transaction 
fact table comprise a standard star schema model, 
which enables BI analysts to create cubes from these 
tables and extract information without regard to 
which account feature was used to generate the trans- 
action. The M:N relationship between the Account 
dimension and the Transaction fact table, which is 
manifested in AccountFeatures, is the structure that 
will enable market-basket analysis without having 
to create a separate star schema. By linking each 
transaction with the account feature that was used 
to generate the transaction, BI analysts can begin to 
find answers to the questions asked about customer 
buying behavior. 

By not creating two separate star schemas (i.e., 
a Transaction star schema and an AccountFeatures 
star schema), you're saving disk storage. If you were 
to combine the Account and AccountFeatures tables 
into a single dimension, that dimension would swell by 
a factor of 15—more if additional features are added 
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MANY-TO-MANY RELATIONSHIPS 


to the build-to-order checking account program. In 
previous articles, such as “Data Warehousing: Junk 
Dimensions,” May 2008, InstantDoc ID 98356, I 
have discussed the desirability of keeping dimensions 
tight and concise; adding features to the Account 
dimension would create a large, unwieldy dimension 
that wouldn't function well when being cubed. Denor- 


more innovative, offering product and service com- 
binations that are derived from complex analysis of 
customers’ buying behaviors. You'll need to be able to 
support market-basket efforts with your data ware- 
house design; understanding how to do so will make 
you a hero. SQL] 

InstantDoc ID 98921 


malizing the Transaction fact table by incorporating 
account feature attributes wouldn't be the best idea, 
either. The fact table would contain the most records 
of all the tables in the schema; increasing its size and 
adding more keys would diminish performance and 
cause locking and blocking conflicts when the con- 
ventional BI analyst and the market-basket BI analyst 
run their programs simultaneously. 
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For more information about data warehousing: 

“Discover the Star Schema,” InstantDoc ID 96112 

“Building Your Data Warehouse,” InstantDoc ID 
96313 

“Data Warehousing: Measuring the Facts," 
InstantDoc ID 96336 

“Data Warehousing: Dimension Basics," 
InstantDoc ID 96846 

“The Data Warehouse Bus Architecture," 
InstantDoc ID 96926 

“Data Warehousing: Dimensional Role-Playing,” 
InstantDoc ID 97272 

“Data Warehousing: Slowly Changing Dimensions,” 
InstantDoc ID 97409 

“Data Warehousing: Rapidly Changing Monster 
Dimensions,” InstantDoc ID 97653 

“Data Warehousing: Junk Dimensions,” 
InstantDoc ID 98356 

“Data Warehousing: Degenerate Dimensions,” 
InstantDoc ID 98722 

“Metamodel for Retail Sales,” InstantDoc 
ID 20409 


Beyond Retail 

Market-basket analysis is an important operation that 
has uses that go beyond the retail sector. For example, 
you can discover relationships in a medical database, 
such as patients who had procedure X subsequently 
developing condition Y within n days/weeks/months. 
In a law enforcement or social profiling database, you 
might discover that people who engage in behavior 
A by age n have a d percent chance of engaging in 
behavior B for the rest of their life. 


Market-Basket Hero 

Market-basket/affinity analysis is predictive: It 
attempts to forecast future behavior based on past 
actions. Business is getting more and more competi- 
tive; in response to this pressure, they are becoming 
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Virtualization Shootout: 
Hyper-V vs. ESX Server 3.5 


These products deliver excellent performance and 
make it easy to virtualize your SOL Server instances 


irtualization is steadily making its way into the 

SOL Server world for use in server consolida- 
tion scenarios as well as for development and testing. 
VMware's ESX Server 3.5 is the established leader in 
the server virtualization market. However, Windows 
Server 2008’s Hyper-V has an architecture that mounts 
a serious challenge to ESX Server 3.5. 

In this article, I compare Hyper-V to ESX Server 
3.5. I explore the architectural differences between the 
two products and compare the products’ overall feature 
sets. Next, I discuss my management and setup experi- 
ences. Then I compare the products’ performance. For 
Hyper-V, a newcomer to the virtualization market, 
performance is a real question mark, whereas ESX 
Server 3.5 has proven that it can deliver enterprise-class 
performance. Can Hyper-V meet the virtualization 
standards set by ESX Server 3.5? To find out, I ran 
basic performance tests on both platforms with some 
revealing results. All of the testing in this article was 
performed using the beta version of the Hyper-V code 
that was shipped with Server 2008. 


Thick vs. Thin Hypervisor 

ESX Server 3.5 and Hyper-V both utilize hypervisor- 
based architectures. Hypervisor-based virtualization 
performs better than older, hosted virtualization 
technologies. Hosted virtualization products, such as 
Microsoft Virtual Server 2005 R2, run the virtualiza- 
tion software on top of the host OS, which adds sig- 
nificant overhead and a longer code execution path for 
the virtual machines (VMs) that run in the hosted vir- 
tualization environment. In contrast, hypervisor-based 
products, such as ESX Server 3.5 and Hyper-V, are 
designed to run the hypervisor directly on the system 
hardware. There's no OS between the hypervisor and 
the system hardware. Although ESX Server 3.5 and 
Hyper-V both share a similar hypervisor-based archi- 
tecture, there are significant differences in the way the 
products are designed, as you can see in Web Figure 1 
(www.sqlmag.com, InstantDoc ID 99218). 

One of the biggest differences between ESX Server 
3.5’s hypervisor and Hyper-V's hypervisor is the way 
they handle device drivers. With ESX Server 3.5, the 
hardware drivers are part of the hypervisor, which 
increases the size of the hypervisor and limits the 
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hardware that ESX Server 3.5 can run on. That said, 
ESX Server 3.5 is supported on most of the server 
systems made by all the tier-one vendors such as HP, 
Dell, and IBM. These vendors also sell system configu- 
rations with ESX Server 3.5 preloaded. To see a list of 
ESX Server-compatible systems, go to www.vmware 
„com/pdf/vi35 systems, guide.pdf. 

In contrast, Hyper-V uses a microkernel hypervisor 
that doesn't contain any device drives. The hypervisor 
contains the minimum amount of code reguired to 
schedule and share hardware resources between the 
active VMs. This architecture ensures that the hyper- 
visor has the best possible performance and reduces 
the potential attack surface of the hypervisor. Hyper-V 
leverages the native Windows device-driver model uti- 
lizing the device drivers in the guest VMs. Hyper-V also 
includes a new high-performance VM archi- 
tecture that enables Server 2008, Windows 
Vista, Winders Server 2003, and Xen-enabled 
Linux to pass hardware requests along a new 
memory pipeline called the VMBus. 

Both products are managed using the first VM 
partition. ESX Server 3.5's VM partition is based on a 
Linux shell and is command-line oriented. In addition, 
ESX Server includes an easy-to-use Windows-based 
management console called the Virtual Infrastructure 
Client, which can be downloaded from ESX Server 
3.5’s Web console. Likewise, Hyper-V is managed using 
the VM running in the first partition. For Hyper-V, 
this partition is called the parent partition. Other VMs 
run in child partitions. 
Hyper-V’s parent parti- 
tion is also used to run 
VMs with legacy OSs, 
such as Windows 2000 
and Windows NT, that 
can't utilize Hyper-V’s 
VMBus architecture 
and must run using the 
older hardware model. 
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new architecture and 64-bit host system bring its fea- 
ture set into parity with ESX Server 3.5. You can see a 
comparison of the products’ features in Web Table 1. 

The primary differences between the products 
begin with the hypervisor itself: As I said earlier, ESX 
Server 3.5 includes a hypervisor that contains device 
drivers. In contrast, Hyper-V includes a hypervisor 
that contains no device drivers. Both products provide 
support for 32-bit x86 and 64-bit x64 guest OSs and 
large VMs with as much as 64GB of RAM per VM. 
Booting VMs from either an iSCSI or Fibre Channel 
SAN is also supported by both products. For Hyper-V, 
the physical host server supports only hardware-based 
virtualization (e.g., Intel-VT or AMD-V), and virtual- 
ization must be enabled in the BIOS. 

One area in which ESX Server 3.5 excels is in sup- 
port for Live Migration (i.e., moving running VMs 
from one host to another). However, this feature 
requires VMware's VirtualCenter, which comes at an 
additional cost. VirtualCenter is included with the 
VMware Infrastructure Enterprise product. VMware's 
Live Migration capability (called VMotion) also 
requires a SAN. Hyper-V doesn’t support Live Migra- 
tion, but when combined with failover clustering 
Hyper-V does support what Microsoft calls Quick 
Migration (i.e., saving the state of a running VM 
and moving that VM to another host where it can be 
quickly restarted). ESX Server 3.5 is limited to 128 
active VMs (which is probably more than enough for 
most people), and Hyper-V is limited only by the avail- 
able system resources. Unlike desktop virtualization 
products, neither product provides support for guest 
audio or USB. 

Overall, the feature set provided by each product 
is comparable. ESX Server 3.5, in combination with 
VirtualCenter, offers Live Migration, but Hyper-V 
supports larger host systems and more active VMs. 
Both products support 64-bit guests with large memory 
addressability. 


Setup and Management 

Setting up both systems was relatively easy. 
However, setting up ESX Server 3.5 was much 
faster and easier than the Hyper-V installation. 
Although the ESX Server 3.5 installation was 
character-based, the screens were easy to follow, 
and I had a functional server running in about 
20 minutes. The setup itself prompted for all the 
necessary networking and configuration informa- 
tion and there was no need to reboot the server. 
In fact, I never needed to reboot the ESX Server 
system during the entire testing process, which is 
a statement I can’t make about Server 2008 and 
Hyper-V. Although ESX Server 3.5 is natively 
managed using a Linux-based command shell, 
I never really needed to deal with the command 
shell during my tests. The Virtual Infrastructure 
Client, which is shown in Figure 1, enabled me to create 
and manage all the VMs that were required. 

I downloaded and installed the Virtual Infrastruc- 
ture Client by pointing my browser to ESX Server 3.5’s 
IP address. The Virtual Infrastructure Client manage- 
ment console connected and worked right away with 
no hassles. I found its management layout to be very 
professional and usable. It provides both VM manage- 
ment and host performance information. Although 
the management console doesn’t let you perform many 
server reconfiguration functions, in my testing, no 
server reconfigurations were required. 

I had a good deal of trouble getting Hyper-V set 
up and running. However, Pm sure that much of my 
trouble occurred because I was using a beta version of 
Hyper-V. To perform the SQL Server testing, I originally 
attempted to install Hyper-V using Server Core but 
found the process hamstrung by Hyper-V’s nonfunc- 
tional remote management. (Remote management is 
mandatory for Server Core, which doesn’t provide a 
graphical interface.) I then tried John Howard’s (a senior 
program manager for Hyper-V) 17-step process for get- 
ting Hyper-V remote management to work, which can be 
found at blogs.technet.com/jhoward/archive/2008/03/30/ 
part-3-hyper-v-remote-management-you-do-not-have- 
the-requested-permission-to-complete-this-task-con- 
tact-the-administrator-of-the-authorization-policy- 


for-the-computer-computername.aspx. However, this 


process didn’t work for me. This convoluted installation 
process could be a real showstopper for Hyper-V and 
will obviously have to be fixed before the final release of 
the feature. Running Hyper-V on Server Core would be 
advantageous because of its lower overhead and reduced 
attack surface. The fact that I couldn’t make Hyper-V 
work on Server Core really made me appreciate ESX 
Server 3.5’s simple and functional delivery of the man- 
agement client. 

Due to time constraints, I gave up on the Server 
Core installation and ran Hyper-V on a full Server 2008 
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installation using the local console for management. 

After installing Server 2008, I installed the Hyper-V 

role using Server 2008's Server Manager. Not counting 

the failed Server Core installation attempt, it took three 
reboots before the server was fully set up. 

Figure 2 shows Hyper-V's management console. 
Theoretically, multiple Hyper-V server instances can 
be managed in the left pane of Hyper-V Manager, 
although I couldn't connect to any remote servers. 
Selecting a server instance displays that server’s VMs in 
the Virtual Machines pane. You can then manage the 
VMs by right-clicking them and selecting 
options from the context menu. 1 found 
Hyper-V Manager to be less functional 
and less well thought out than ESX Server 
3.5' management console. I thought that 
the Snapshot pane in particular was poorly 
positioned, and it seemed that everything 
was forced to fit inside the generic Microsoft 
Management Console (MMC) 3.0 frame- 
work. Although Hyper-V Manager doesn't 
show detailed server performance informa- 
tion, basic VM management using the man- 
agement console was functional and easy. 
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road. To compare the performance of ESX 
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Server 3.5 and Hyper-V, I created four VMs, 
each of which was running Server 2008 
Enterprise Edition configured with 512MB of RAM 
running SQL Server 2005 Enterprise Edition SP2. 
For each platform, I used the default virtual hard disk 
configuration options. Four Server 2008 instances can 
be simultaneously active as VM guests with no addi- 
tional Server 2008 licensing costs. (Pl discuss licensing 
in more detail later.) 

All of my testing was conducted on an HP Pro- 
Liant ML370 G5, which is a rack-mounted 4U server. 
My test unit included two Quad-Core Intel Xeon 
processors running at 1.86 GHz on a 1066 MHz front- 
side bus. The ProLiant ML370 G5 came equipped with 
8GB of RAM and eight 72GB 15,000 RPM drives 
configured as a RAID array. 

For the test workload I used 27 different queries 
running against the sample AdventureWorks database. 
Although the bulk of the workload was data retrieval, 
the batch also contained a couple of computational 
loops and four SELECT INTO statements to add 
some computational and data modification operations. 
A think time of three seconds was inserted between 
each database interaction. 

The workload was executed from four physical 
client machines, each of which used Microsoft's sqlemd 
utility to launch the workload against all four SQL 
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Server systems running as VMs. Then the client paused 
for three seconds. This process was repeated 10 times, 
creating a total of 40 running jobs from each of the 
four clients where each client was running 10 separate 
connections per server. This test revealed how each 
of the virtualization platforms was able to respond to 
increasing workloads and to the total workload with 
all four virtualized SQL Servers actively servicing 40 
queries. The final performance measurement that I 
used was the average time required to complete the 
measured workloads. The tests were run multiple times 
and the results were averaged. Web Figure 2 compares 
the overall performance of the two platforms. 

As you can see in Web Figure 2, the results were 
surprisingly close, with ESX Server 3.5 providing better 
performance under the tested workload. Although 
ESX Server 3.5 edged out Hyper-V by 4 percent in the 
performance tests I ran, Hyper-V made a competitive 
showing in spite of the fact that it was prerelease code. 
I expect the final release will provide very similar per- 
formance results. 1 also expect that the glaring remote 
management problems will be corrected in the final 
release. However, I don't expect Hyper-V's manage- 
ment console to be improved to match ESX Server 3.5's 
Virtual Infrastructure Client. You should bear in mind 


Figure 2 


Hyper-V's management 


console 
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thatalthough these results were representative of the test 
workload that I ran, it's possible that different workloads 
and environments could produce varied results. 


Virtualization and Licensing 

Price is another area in which there's a significant dif- 
ference between ESX Server 3.5 and Hyper-V. ESX 
Server 3.5 has always been a chargeable product and 
is the staple of VMware's product line. In my com- 
parison of each of the product’s licensing and pricing 


structures, I used VMware Infrastructure Foundation 
3, which includes the following components: 

+ ESX Server 3.5 

e VMware Virtual Machine File system (VMES) 

e VMware SMP 

+ VirtualCenter Agent 

+ VMware Consolidated Backup 

e VMware Update Manager 


In contrast, Hyper-V is included in Server 2008, 
making it essentially free to organizations run- 
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$2,499 bundle price! 


ning Server 2008. Because Hyper-V is a 64-bit 
technology, you must have x64 hardware and use 
one of the x64 editions of Server 2008. Hyper-V 
is included in Server 2008 Standard Edition x64, 
Server 2008 Enterprise Edition x64, and Server 
2008 Datacenter Edition x64 

Microsoft does make three versions of Server 
2008 that don't include Hyper-V (aptly named Server 
2008 without Hyper-V), and the price difference is 
negligible at only $28. Hyper-V also isn't included in 
Windows Web Server 2008, Windows HPC Server 
2008, or Server 2008 for Itanium-based Systems. 

The pricing for the two tested configurations 
is shown in Web Table 2. Note that this table uses 
retail pricing, whereas most businesses use Micro- 
soft's volume licensing. 

One important point to be aware of is that Server 
2008 Enterprise Edition and Windows 2003 R2 
allow for as many as four active virtual Windows 
instances with no additional costs. This licensing 
is the same whether you use Microsoft or VMware 
virtualization products. In addition, Server 2008 
Datacenter Edition and Windows 2003 R2 allow an 
unlimited number of virtual Windows instances no 
matter which virtualization platform you're using. 
You can also run an unlimited number of SQL 
Server 2005 Enterprise Edition instances on VMs no 
matter which virtualization technology you're using. 
The end result is that the inclusion of Hyper-V with 
Server 2008 makes Hyper-V a less expensive virtual- 
ization option than ESX Server 3.5. 


The Bottom Line 
My testing revealed that both Hyper-V and ESX 
Server 3.5 deliver excellent levels of performance. 
For midsized businesses, especially those organiza- 
tions just getting into virtualization, I found Hyper-V 
to be compelling. Being bundled with Server 2008 
makes the price point very attractive and the Win- 
dows-based management tools make it easy to use. 
However, ESX Server 3.5's better performance, more 
mature management tools, and Virtual Infrastruc- 
ture 3 management suite provide a more feature-rich 
platform for large organizations. SQL] 
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DATA SCRIPTING 
r AND EXPORTING 
O hs siroce vts tais Export and Script Data in T-SQL 
2 ee Thomas Smyczek has announced SQL Scripter 
pecera 1.92, the latest version of his scripting utility for SQL 
(E) tinc amos Server that lets DBAs script their data in T-SQL 
format. SQL Scripter offers scripting options for a 
variety of data types, including binary, image, and 
varbinary. Version 1.92 fixes an identify-insert-value 
bug and adds support for scripting ntext fields (with more than 4,000 characters) and text fields (with more than 
8,000 characters). To run SQL Scripter 1.92, Microsoft NET Framework (2.0 or later) needs to be installed on your 
system. SQL Scripter 1.92 is free; you can download it at www.sqlscripter.com. 


REPORTING SERVICES 

Streamline Creation of SSAS OLAP Reports 

Simplifying the task of creating SQL Server Analysis Services (SSAS) OLAP Reports using SQL Server 2005 
Reporting Services (SSRS) is the focus of Intelligencia Query for BIDS, a new product from iT-Workplace. 
According to the vendor, the product includes a query builder and a custom data extension that lets DBAs render 
MDX queries without being limited by constraints imposed by SSRS. Pricing for Intelligencia Query for BIDS 
starts at $169 per developer and $694 per SSRS instance. For more information, contact iT-Workplace at info@ 


it-workplace.com or visit www.it-workplace.co.uk. 


APPLICATION DEVELOPMENT 
Generate .NET-Compatible Reports 

Data-analysis tool vendor Data Dynamics announced that its Data 
Dynamics Reports and ActiveReports for NET products now take 
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advantage of the integration features in Microsoft Visual Studio seen 
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2008. This improved integration gives Visual Studio 2008 developers 
a familiar work environment and a consistent report-design inter- 
face. For more information about the Visual Studio 2008 support in 
ActiveReports for .NET and Data Dynamics Reports, contact Data 
Dynamics at 614-895-3142 or visit www.datadynamics.com. 


DATABASE DEVELOPMENT 
Enforce Best Practices 

Ensuring that database development meets predefined standards is the focus of ApexSQL Enforce 2008, a new SQL 
Server utility that helps enforce best practices in database development. According to the vendor, ApexSQL, DBAs 
can use the product to identify database-design problems early in development, reinforce adherence to develop- 
ment best practices and standards, and optimize databases for portability to non-SQL Server platforms. Pricing 
for ApexSQL Enforce 2008 starts at $499 per server. For more information, contact ApexSQL at 919-968-8444 or 
866-665-5500 or visit www.apexsql.com. 


SQL SERVER MANAGEMENT 

Manage SSRS with Microsoft System Center Operations Manager 2007 

AVIcode has announced AVIcode Reporting Services Management Pack, an add-on for Microsoft System Center 

Operations Manager 2007 that monitors and troubleshoots SSRS problems. According to AVIcode, this new 

management pack can monitor and detect a variety of issues and system failures within SSRS, including faulty 

SQL statements, performance bottlenecks, and data inconsistencies. Pricing for AVIcode Reporting Services 

Management Pack starts at $3,995 per server. For more information, contact AVIcode at 443-543-0030 or visit 

www.avicode.com. SQL| 
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pi: Visual Studio (VS) 2008, you'll find 
a new edition of SOL Server called SOL 
Server Compact Edition. Formerly known as 
SOL Server Everywhere, SOL Server Compact 
Edition 3.5 is Microsoft's smallest version of SQL 
Server. Unlike other SQL Server editions, SQL 
Server Compact Edition runs in-process with 
your application. You can use it as a database 
for desktop applications or mobile applications 
and distribute it with your applications. And it's 
free—it’s part of VS 2008 and VS 2008 Express, 
or you can download it from microsoft.com/sql/ 
editions/compact/downloads.mspx. 


What OSs does it support? 

SQL Server Compact Edition supports server, 
desktop, and device OSs including various editions 
of Windows Vista, Windows Server 2003, Windows 
XP, and Windows Mobile. 


What are its system 
requirements? 

Designed for mobile applications, SQL Server Com- 
pact Edition has a very small footprint requiring 
only 5MB of RAM and 2MB of disk storage. It 
uses a single .sdf file for storage and supports data- 
bases up to 4GB. On the software side, SQL Server 
Compact Edition requires the Microsoft .NET 


Is SQL Server Compact Edition a 
full relational database? 

Yes. It provides full referential integrity, including support 
for cascading deletes and multiple column indexes. It also 
supports transactions, including the ability to commit and 
roll back database changes. However, it doesn't support 
stored procedures. Data files can be encrypted using RSA 
128-bit encryption. 


How do you manage SQL Server 
Compact Edition databases? 

You create databases using VS 2008 and SQL Server 
Compact 3.5 Design Tools. You manage them with SQL 
Server Management Studio or SQL Server Management 
Studio Express. 


How do you program SQL Server 
Compact Edition applications? 

To program SQL Server Compact Edition applications, 
you can use VS 2008 (including the Express editions) and 
the NET Framework 3.5. For data access middleware, you 
can use LINQ, ADO.NET, and OLE DB. 


Can it integrate with other 
versions of SQL Server? 

Yes. You can connect it with SQL Server 2008 or SQL 
Server 2005 using the new Synchronization Services for 
ADO.NET, merge replication, or Remote Data Access. 


Framework 3.5, Microsoft Internet Explorer (IE) 


6.0, ActiveSync 4.0, and VS 


hen it comes to new product versions and 
releases, Michael Otey is pretty darn hip. He's 
so knowledgeable about upcoming improvements 
and new features. | am always amazed at how far 
in front of the curve he ¡s. As you can probably tell 
by my less-than-chic yellow tu outdated 
hairstyle, | am none of those things. It takes me a 
„and I'm more inclined to stick 
with whatever works—at least until it breaks. So | 
understand if you (or whoever makes your organiza- 
tion’s upgrade decisions) aren’t convinced to upgrade 
| to Visual Studio 20 ile Michael flies on ahead 
of me to trendy-town, I'm dawdling behind so that | 
can provide some Visual Studio 2005 resources that 
might be right up your alley. 
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